ike 

SEIN 

eee GRAPH Network 
Poni 


Intro to data 
analysis with R 


Unlocking the power of R for 
public health data science 


This book is a compilation of lesson notes from the 3-month online course offered by The GRAPH 
Courses.. To access the lesson videos, exercise Rmds, and online quizzes, please visit our website, 
thegraphcourses.org 


Lesson notes | Setting up R and RStudio 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


PROC EI OIA eoe tn ies ar oe an ah ar oe ares en Bice yo gh teak hes keer Serbian dae galas Beh id ph ad Bee ae 
Working locally vs- OM the ClOUG 5. essere ssogra oh A arp ENES OOS OE RESTS ENR 
RStudio on the cloud 22. csscnncsedeadedawaddhece ach EAN EAE eee a ane 
Setup on WOO 66k aed ok bb tirkes BREA Ae RRES ER aS OE oS Ra ae bee 
Downoad and install R uvaceccdvotageena das doe dua de cobb we a ease ke eek a 
Download, install &run RStudio 2.0... eee A 
SO ip OCS 25-6 oe eug ae cee oes ESAS State E Se Ae ORS kee ee ee eee ae ae 
Download anad MSTA R <c¢1a26054666e04¢64 00 Hoe dae bbe h Sub SO SAS EES HE GREE RE SOM 
Download, install & run RStudio 2.0... ee ee ene 
Wa UD: a siperi deen eaa nh ae eo ead oe ahead Bee eee fae hw Ag Oe od oo a aed Ds 


Learning objective 


1. You can access R and RStudio, either through RStudio.cloud or by downloading and 
installing these software to your computer. 


Introduction 


To start you off on your R journey, we'll need to set you up with the required software, R 
and RStudio. Ris the programming language that you'll use write code, while RStudio is 
an integrated development environment (IDE) that makes working with R easier. 


Working locally vs. on the cloud 


There are two main ways that you can access and work with R and RStudio: download 
them to your computer, or use a web server to access them on the cloud. 


Using R and RStudio on the cloud is the less common option, but it may be the right 
choice if you are just getting started with programming, and you do not yet want to 
worry about installing software. You may also prefer the cloud option if your local 
computer is old, slow, or otherwise unfit for running R. 


Below, we go through the setup process for RStudio Cloud, Rstudio on Windows and 
RStudio on macOS separately. Jump to the section that is relevant for you! 


WATCH OUT 


WATCH OUT 
than 25 hours per month, you may want to avoid this option. 


RStudio on the cloud 


If you'll be working on the cloud, follow the steps below: 
1. Go to the website rstudio.cloud and follow the instructions to sign up for a free 


account. (We recommend signing up with Google if you have a Google account, so 
you don't need to remember any new passwords). 


2. Once you're done, click on the “New Project” icon at the top right, and select “New 
RStudio Project”. 


Astudio Cloud © Your Workspace Projects Usage About Jane Doe 


Spaces 


E Your Projects Your Projects (0) New Project v 
© Your Workspace 
TNES fi Trash access| 3K v | SORT @ New RStudio Project 
one New Jupyter Project 
no pro, 
Learn 
Q New Project from Git Repository 
® Guide 
! What's New 


You should see a screen like this: 


File Edit Code View Plots Session Build Debug Profile Tools 
O\~ ye al Go to file/function ~ Addins ~ 
Console Terminal Jobs A 


R R4.2.0 - /cloud/project/ 


R version 4.2.0 (2022-04-22) -- "Vigorous Calisthenics" 
Copyright CC) 2022 The R Foundation for Statistical Computing 
Platform: x86_64-pc-linux-gnu (64-bit) 


R is free software and comes with ABSOLUTELY NO WARRANTY. 
You are welcome to redistribute it under certain conditions. 
Type 'license()' or 'licence()' for distribution details. 


R is a collaborative project with many contributors. 
Type 'contributors()" for more information and 
"citationd)' on how to cite R or R packages in publications. 


Type 'demo()' for some demos, 'help()' for on-line help, or 
'help.start()' for an HTML browser interface to help. 
Type 'qQ)' to quit R. 


Session restored from your saved work on 2022-May-29 08:52:07 UTC (41 
minutes ago) 
> 


Help 


R 4.2.0 ~ 
Environment History Connections Tutorial a= 
<= H import Dataset ~” 122 MiB ~ | 8 List ~ ~ 
R ~ Í} Global Environment ~ 
Environment is empty 
Files Plots Packages Help Viewer Presentation le) 
O/O -0O a e- 
@& Cloud > project ® 
4 Name Size Modified 

E 

‘>| .Rhistory OB May 29, 2022, 

P project.Rproj 205 R May 29, 2022, 


This is RStudio, your new home for a long time to come! 


At the top of the screen, rename the project from "Untitled Project” to something like 


w 


r_intro”. 


Edit Code View Plots Session 


z 


File 


© -| 2- 


Go to file/function X 


+ Click to name your project 


Build Profile Tools Help 


Debug 


Addins ~ 


You can start using R by typing code into the “console” pane on the left: 


Terminal Jobs ion 


R R 4.2.0 - /cloud/project/ 


R version 4.2.0 (2022-04-22) -- "Vigorous Calisthenics" 
Copyright (C) 2022 The R Foundation for Statistical Computing 
Platform: x86_64-pc-Linux-gnu (64-bit) 


R is free software and comes with ABSOLUTELY NO WARRANTY. 
You are welcome to redistribute it under certain conditions. 
Type 'LicenseQ)' or 'licence()' for distribution details. 

R is a collaborative project with many contributors. 

Type 'contributors()' for more information and 

"citation()' on how to cite R or R packages in publications. 
Type 'demo()' for some demos, 'helpQ)' for on-line help, or 
"help.startQ)' for an HTML browser interface to help. 

Type 'qQ)' to quit R. 


Session restored from your saved work on 2022-May-29 08:52:07 UTC (41 


minutes «Write code here 


Try using R as a calculator here; type 2 + 2 and press Enter. 


That's it; you're ready to roll. Whenever you want to reopen RStudio, navigate to 
rstudio.cloud, 


Proceed to the “wrapping up” section of the lesson. 


Set up on Windows 


Download and install R 
If you're working on Windows, follow the steps below to download and install R: 


1. Go to cran.rstudio.com to access the R installation page. Then click the download 
link for Windows: 


The Comprehensive R Archive Network 


Download and Install R 


Precompiled binary distributions of the base system and contributed packages, Windows and Mac users most likely 
want one of these versions of R: 


e Download R for Linux (Debian, Fedora/Redhat, Ubuntu) 
e Download R for macOS 


e Download R for Windows 


R is part of many Linux distributions, you should check with your Linux package management system in addition to 
the link above. 


2. Choose the “base” sub-directory. 


R for Windows 


Subdirectories: 

Binaries for base distribution. This is what you want to install R for the first time. 
contrib Binaries of contributed CRAN packages (for R >= 3.4.x). 
old contrib Binaries of contributed CRAN packages for outdated versions of R (for R < 3.4.x). 


3. Then click on the download link at the top of the page to download the latest 
version of R: 


R-4.2.0 for Windows 


Download R-4.2.0 for Windows (79 megabytes, 64 bit) 


README on the Windows binary distribution 
New features in this version 


This build requires UCRT, which is part of Windows since Windows 10 and Windows Server 2016. On older systems, UCRT 
has to be installed manually from here. 


Note that the screenshot above may not show the latest version. 


4. After the download is finished, click on the downloaded file, then follow the 
instructions on the installation pop-up window. During installation, you should not 
have to change any of the defaults; just keep clicking “Next” until the installation is 
done. 


Well done! You should now have R on your computer. But you likely won't ever need 


to interact with R directly. Instead you'll use the RStudio IDE to work with R. Follow 
the instructions in the next section to get RStudio. 


Download, install & run RStudio 


To download RStudio, go to rstudio.com/products/rstudio/download/#download and 
download the Windows version. 


2 e Download RStudio Desktop. Recommended for your system: 


MIM DOWNLOAD RSTUDIO FOR WINDOWS 


am 2022.02.0+443 | 176.76MB 


After the download is finished, click on the downloaded file and follow the installation 
instructions. 


Once installed, RStudio can be opened like any application on your computer: press the 
Windows key to bring up the Start menu, and search for “rstudio”. Click to to open the 


app: 
All Apps Documents Web 


Best match 


R RStudio 
App 


Search the web 


Ø rstudio - See web results 


Ja rstudio download 


P [edio] 


You should see a window like this: 


@ Rstudio - a x 
File Edit Code View Plots Sesion Build Debug Profile Tools Help 
© - OR) 2- =] > Go to file/function C= + Addins ~ ® Project: (None) ~ 
Console Terminal Jobs oO Environment History Connections Tutorial aM 
R R412 - -/ & Gl | import Dataset ~ | 9 160 MiB ~ | 8 list» l= 
P . ae? R ~ | i Global Environment ~ Qa 
R version 4.1.2 (2021-11-01) -- “Bird Hippie" 


Copyright (C) 2021 The R Foundation for Statistical Computing 


Platform: x86_64-w64-mingw32/x64 (64-bit) 


Environment is empty 


R is free software and comes with ABSOLUTELY NO WARRANTY. 
You are welcome to redistribute it under certain conditions. 
Type 'license()' or 'licence()' for distribution details. 


R is a collaborative project with many contributors. 
Type ‘contributors()' for more information and 
‘citation()' on how to cite R or R packages in publications. 


Files Plots Packages Help Viewer m] 


Type 'demo()' for some demos, ‘help()' for on-line help, or 4q A f Q 


'help.start()' for an HTML browser interface to help. 
Type 'q()' to quit R. 


> 242 
[1] 4 
> 4+4 


R: Apply a Function over a List or Vector + Find in Topic 


lapply {base} R Documentation 


Apply a Function over a List or Vector 


Description 


lapply returns a list of the same length as x, each element of which is the result of 
applying FUN to the corresponding element of x. 


sappy is a user-friendly version and wrapper of lapp1ly by default returning a vector, 
matrix or, if simplify = "array", an array if appropriate, by applying 
simplify2array().sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) 
is the same as lapply(x, f). 


vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer 
(and sometimes faster) to use. 


This is RStudio, your new home for a long time to come! 


You can start using R by typing code into the “console” pane on the left: 


Terminal Jobs ‘om 
R 4.1.2 - ~/ 


R version 4.1.2 (2021-11-01) -- "Bird Hippie” 
Copyright (C) 2021 The R Foundation for Statistical Computing 
Platform: x86_64-w64-mingw32/x64 (64-bit) 


R is free software and comes with ABSOLUTELY NO WARRANTY. 
You are welcome to redistribute it under certain conditions. 
Type 'licenseQ)' or ‘licenceQ)' for distribution details. 


R is a collaborative project with many contributors. 
Type ‘contributors()' for more information and 
'citation()' on how to cite R or R packages in publications. 


Type ‘demo()' for some demos, ‘help()' for on-line help, or 


'help.start()' for an HTML browser interface to help. 
Type ‘qQ)' to quit R. 


6 Write R code here! 


Try using R as a calculator here; type 2 + 2 and press Enter. 


That's it; you're ready to roll. Proceed to the “wrapping up” section of the lesson. 


Set up on macOS 


Download and install R 
If you're working on macOS, follow the steps below to download and install R: 


1. Go to cran.rstudio.com to access the R installation page. Then click the link for 
macOS: 


The Comprehensive R Archive Network 


Download and Install R 


Precompiled binary distributions of the base system and contributed packages, Windows and 
Mac users most likely want one of these versions of R: 


R is part of many Linux distributions, you should check with your Linux package management 
system in addition to the link above. 


2. Download and install the relevant R version for your Mac. For most people, the first 
option under “Latest release” will be the one to get. 


10 


R-4.2.0.pkg (notarized and signed) 


SHA\I-hash: 2a90fb8629e44f72f9d89d6a9bac9b7 156458747 
(ca. 90MB) for Intel Macs 


Latest version for Intel Macs 


R-4.2.0-arm64 pkg (notarized and signed) 


SHA 1-hash: ada2602d245 164d3 16967d24f5482b58e2dfddff 
(ca. 89MB) for M1 Macs only! 


Latest version for M1 Macs 


NEWS (for Mac GUI) 


Mac-GUI-1.78 tar.gz 


SHA 1-hash: 23b3c41b7eb77 1640fd504a75e5782792dddb2be 


Latest release: 


R 4.2.0 binary for macOS 10.13 (High Sierra) and higher, Intel 64-bit build, signed and notarized package. 
Contains R 4.2.0 framework, R.app GUI 1.78 in 64-bit for Intel Macs, Tcl/Tk 8.6.6 X11 libraries and 
Texinfo 6.7. The latter two components are optional and can be ommitted when choosing "custom install", 
they are only needed if you want to use the tc1tk R package or build package documentation from sources. 


Note: the use of X11 (including tc1tk) requires XQuartz to be installed (version 2.7.11 or later) since it is 
no longer part of macOS. Always re-install XQuartz when upgrading your macOS to a new major version. 


This release supports Intel Macs, but it is also known to work using Rosetta2 on M1-based Macs. For native 
Apple silicon arm64 binary see below. 


Important: this release uses Xcode 12.4 and GNU Fortran 8.2. If you wish to compile R packages from 
sources, you may need to download GNU Fortran 8.2 - see the tools directory. 


R 4.2.0 binary for macOS 11 (Big Sur) and higher, Apple silicon arm64 build, signed and notarized 
package. 

Contains R 4.2.0 framework, R.app GUI 1.78 for Apple silicon Macs (M1 and higher), Tcl/Tk 8.6.12 X11 
libraries and Texinfo 6.8. 

Important: this version does NOT work on older Intel-based Macs. 


Note: the use of X11 (including tc1tk) requires XQuartz (version 2.8.1 or later). Always re-install XQuartz 
when upgrading your macOS to a new major version. 


This release uses Xcode 13.1 and experimental GNU Fortran 12 arm64 fork. If you wish to compile R 
packages which contain Fortran code, you may need to download GNU Fortran for arm64 from 
https://mac.R-project.org/tools. Any external libraries and tools are expected to live in /opt/R/armé4 to not 
conflict with Intel-based software and this build will not use /usr/local to avoid such conflicts (see the 
tools page for more details). 


News features and changes in the R.app Mac GUI 


Sources for the R.app GUI 1.78 for macOS. This file is only needed if you want to join the development of 
the GUI (see also Mac-GUI repository), it is not intended for regular users. Read the INSTALL file for 
further instructions. 


Note: Previous R versions for El Capitan can be found in the el-capitan/base directory. 


pkg (signed) 
62c9b1f9b45d778f05b8d9aa25a9123b3557c4 


S| 
(ca. 77MB) 
For older macs 


Binaries for legacy OS X systems: 


R 3.6.3 binary for OS X 10.11 (El Capitan) and higher, signed package. Contains R 3.6.3 framework, R.app 
GUI 1.70 in 64-bit for Intel Macs, Tcl/Tk 8.6.6 X11 libraries and Texinfo 5.2. The latter two components are 
optional and can be ommitted when choosing "custom install", they are only needed if you want to use the 
tcltk R package or build package documentation from sources. 


3. After the download is finished, click on the downloaded file, then follow the 
instructions on the installation pop-up window. 


Well done! You should now have R on your computer. But you likely won't ever need to 
interact with R directly. Instead you'll use the RStudio IDE to work with R. Follow the 
instructions in the next section to get RStudio. 


Download, install & run RStudio 


To download RStudio, go to rstudio.com/products/rstudio/download/#download and 
download the version for macOS. 


2. Download RStudio Desktop. Recommended for your system: 


ú DOWNLOAD RSTUDIO FOR MAC 


2022.02.0+443 | 217.18MB 


After the download is finished, click on the downloaded file and follow the installation 


instructions. 


Once installed, RStudio can be opened like any application on your computer: Press 
Command + Space to open Spotlight, then search for “rstudio”. Click to open the app. 


You s 


Oo) -|\@l@-la a 


@ ) Unt 


1 


1:1 


Conso! 


© R&tudio 


hould see a window like this: 


Goto file/function 


itled1 
ra] Q # bd = Run 
| 
(Top Level) > 
le Terminal > Jobs ~ 


Q R412. ~/ 


RStudio 
~ Addins ~ 


am 


kida + Source ~ 


R Script $ 


=O 


R is a collaborative project with many contributors. 


Type 


‘contributors()' for more information and 


"citation()' on how to cite R or R packages in publications. 


Type 'demo()' for some demos, 'help()' for on-line help, or 
"help.start(Q)' for an HTML browser interface to help. 


Type 


> 2+2 
[1] 4 
> 444 
[11 8 


> 


'qaO' to quit R. 


Environment History Connections Tutorial 
<& H | import Dataset ~ | YO1 MiB ~ | 4 
R ~  Ā Global Environment ~ 


Environment is empty 


Files Plots Packages Help Viewer 
Bio -0 H- 
a A Home 
A Name Size 
®) Renviron 123 B 
[J] ©) .Rhistory 0B 
O & Adobe 
©) E Applications 
| backup-rstudio-prefs 144 B 


( 
O © Creative Cloud Files 
O E Desktop 

O E Documents 

{ {© Downloads 

l E Dropbox 

D @@ Library 

O @ Movies 

O B Music 

O @& OneDrive 

[1 E OneDrive - uniae.ch 


This is RStudio, your new home for a long time to come! 


You can start using R by typing code into the “console” pane on the left: 


= List ~ 


2 


a) 


Modified 
Apr 3, 2022 
May 27, 202 


May 27, 202 


(®) Project: (None) ~ 


a.a rup ooro + m wurpen 


Terminal Jobs N 


CR R412. ~/ 


R is a collaborative project with many contributors. 
Type 'contributors()' for more information and 
"citationd)' on how to cite R or R packages in publications. 


Type 'demo()' for some demos, 'help()' for on-line help, or 
"help.start()' for an HTML browser interface to help. 
Type 'qQ)' to quit R. 


> 2+2 
[1] 4 
> 4+4 


a Write code here 


Try using R as a calculator here; type 2 + 2 and press Enter. 


Wrap up 


You should now have access to R and RStudio, so you're all set to begin the journey of 
learning to use these immensely powerful tools. See you in the next session! 


Contributors 


The following team members contributed to this lesson: 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


LAMECK AGASA 


Statistician/Data Scientist 


MICHAL SHRESTHA 


Global Health Researcher, the GRAPH Network 
An advocate of health equity & justice through equal access to health data 


ELTON MUKONDA 


Data analyst, the GRAPH Network 
A data enthusiast with a passion for population health research 


OLIVIA KEISER 


Head of division of Infectious Diseases and Mathematical Modelling, 
University of Geneva 


References 


Some material in this lesson was adapted from the following sources: 


«e Nordmann, Emily, and Heather Cleland-Woods. Chapter 2 Programming Basics | Data 
Skills. psyteachr.github./o, https://psyteachr.github.io/data-skills-v1/programming 
-basics.html Accessed 23 Feb. 2022. 


This work is licensed under the Creative Commons Attribution Share Alike license. 


Lesson notes | Using RStudio 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


Learning ODJECUVES ocsre rrer suena ia wie ove aura aetna see eee kOe bad eee ee 
NEFOGUCHIONN sec athe. bane aha ote and eae Dw. BREE Ged due ada awed we Sad BSR Oe eae a ares 
Fine RStudio Panes <5 445 4 ees acy eenrdoes ade qe bhe1 e644 ee pee ene desde eae gous om 
SOUNCE/ECIOM sue cisco t05e He base bed Od SH GOH Hn WSCA SO Rd donk Kis 4 ake BSE ord we 
SOMSOI ES oid ou, e.g 6-2 koe, a Ee ERE pa E Sila dein tat deb. eo a. Eden anes Gene E aoa ee 
ETANISOUMIAC IME. «ca fc: cris eke: sae ms fete tented clots ies ls ce Oe cara, Ease B Je aE EE patie a ep da dae eh 
POLY eea ara nr Fae ear ada E EE EE E wes ined es OCG REE hee Sh eee eae e a8 
FICS: 4 4:46 6-56 3:95:6,06:69-44O9439.055 B68 2658 OE SES Be E anid BSUS A OO OO EEO 
NES m dest meres ae fia Gv, de eG mae E wrap a each eng E E E ary Grave due: daw sh dak E bones 


DIONE 45564 h one peitai traate tei GEARS Lo He Ree SESE EEL ected 

Di 4.60%-ong gpa maa $b based eae E E eh alg He see ee oR Eee es 84 Bee oe 
oO SOU ae. + -enk jp e oe es ES E ES eae ed RD OOS OH ERG 64a ad a ahk 
Command palette aii 24 6354.440-4 2645448545455.05846 8h 8 ESS SKE SATE OE RED GE BS ESSE 
Wrappihg UD ars kes ase deau es oto ome ee oe Gy he Ean re Rao ees ae oe Se 
Further reSOUrCES ear: codnaebheor abe derana dhedheth sacs deeiebeendhreaead beds 
Peg ee ee ee ee a er ae ere i ee ee eee ee eee er E 


Learning objectives 


1. You can identify and use the following tabs in RStudio: Source, Console, 
Environment, History, Files, Plots, Packages, Help and Viewer. 


2. You can modify RStudio’s interface options to suit your needs. 


Introduction 


Now that you have access to R & RStudio, let’s go on a quick tour of the RStudio interface, 
your digital home for a long time to come. 


We will cover a lot of territory quickly. Do not panic. You are not expected to remember it 
all this. Rather, you will see these topics again and again throughout the course, and you 
will naturally assimilate them that way. 


You can also refer back to this lesson as you progress. 


The goal here is simply to make you aware of the tools at your disposal within RStudio. 


To get started, you need to open the RStudio application: 


e If you are working with RStudio Cloud, go to rstudio.cloud, log in, then click on the 
“r_intro” project that you created in the last lesson. (If you do not see this, simply 
create anewR project using the “New Project” icon at the top right). 


e If you are working on your local computer, go to your applications folder and double 
click on the RStudio icon. Or you search for this application from your Start Menu 
(Windows), or through Spotlight (Mac). 


The RStudio panes 


By default, RStudio is arranged into four window panes. 


If you only see three panes, open a new script with File > New File > R Script. This 
should reveal one more pane. 


| File | Edit Code View Plots Session Build Debug Pr 


New File 1 ®@ R Script X@SN 


Before we go any further, we will rearrange these panes to improve the usability of the 
interface. 


To do this, in the RStudio menu at the top of the screen, select Tools > Global 
Options to bring up RStudio’s options. Then under Pane Layout, adjust the pane 
arrangement. The arrangement we recommend is shown below. 


| Source 


| Console v 


| TabSet 


| Environment, History, Files, Plots v | 


Environment 
History 

Files 

Plots 
Connections 
Packages 
Help 

Build 

vcs 

Tutorial 
Viewer 


[_] Presentations 


V| Environment 
v| History 
v Files 
¥ Plots 
Connections 
v| Packages 
v| Help 
Build 
vcs 
Tutorial 
{v Viewer 
(_} Presentations 


At the top left pane is the Source tab, and at the top right pane, you should have the 


Console tab. 


Then at the bottom left pane, no tab options should checked-—this section should be left 
empty, with the drop-down saying just “TabSet”. 


Finally, at the bottom right pane, you should check the following tabs: Environment, 


History, Files, Plots, Packages, Help and Viewer. 


Great, now you should have an RStudio window that looks something like this: 


@ | Untitled1 eE Console Terminal Background Jobs = C 


Hla #- ->| (>>| [P -| = CR R4.2.1 - /cloud/project/ 

1 
R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid" 
Copyright (C) 2022 The R Foundation for Statistical 
Computing 
Platform: x86_64-pc-Linux-gnu (64-bit) 
R is free software and comes with ABSOLUTELY NO WARR 

Ale ates RScript = Environment History Files Plots Packages Hel ~ M 


= & |! impor ~ | ® 132MB ~ | € List ~ 3 
R ~ f} Global Environment ~ 


Environment is empty 


The top-left pane is where you will do most of the coding. Make this larger by clicking on 
its maximize icon: 


@ | Untitled1 B i] 


BIA #-li > o> p -| = 


Note that you can drag the bar that separates the window panes to resize them. 


Drag to resize 


Now let’s look at each of the RStudio tabs one by one. Below is a summary image of what 
we will discuss: 


Editor 


Type and save code as scripts 


© rstudio_intro.R | women 


Qsx- 
print("excited for R") 
print("and RStudio") 


View(women) 


plot(women) 


WOnNODUFPWNPR 


Environment 
View datasets & other objects in workspace 


2a e- t199mMB- g - 
R ~ Global Environment ~ Q, 

Data 

© ebola_data 200 obs. of 7 variabl.. 


$ id : int 167 129 2.. 
$ age : num 55 41 12 .. 
Plots 


View and export plots created with R 


æ P zoom Sapon. © d 4 Publish = 


150 


weight 


Source/Editor 


@ | Untitled1* 


ebola_data <- read.csv("https://tinyurl.com/ebok 


Console 
Run code 


=O @ R4.2.1 - /cloud/project/ 


> o> ~ = [1] "and RStudio" 


> View(women) 
Se a2 

[1] 4 
>3+3 

[1] 6 


> plot(women) 
> 


=D 


> ebola_data <- read.csv("https://tinyurl .com/ebola-data-sample") 


Í Environment | History ji Files Plots ] Packages Help | Viewer af 


History 


View and search through previous commands 


“A “BToConsole =SeToSource © g 
print("excited for R") 

print("and RStudio") 

View(women) 

2+2 

3+3 

print("excited for R") 


Packages 
Install and load packages 
Bj instat @ update Q 
Name Description Ve 
System Library 
J base The R Base Package 4.2.1 
boot Bootstrap Functions 1.3- 
(Originally by Angelo 28 
Canty for S) 
class Functions for 7.3- 


Classification 20 


AlgliAF-\k 
1 printC"excited for R!") 
2 printC"and RStudio") 


Files 
Interact with the computer's fie system 


TEOLO @- 


@& Cloud © project B.. 
A Name Size 
t. 
.Rhistory 08 
® project.Rproj 205 8 
O` rstudio_intro.R 1358 
Help 


Consult R documentation 


e dA 2» 
R: Average Heights and Weights for American Women = Finc in Topic 


women (datasets} R Documentation 


Average Heights and Weights for 
American Women 


Description 
This data set gives the average heights and weights for American women 
aged 30-39. 


=s Run 


The source or editor is where your R “scripts” go. A script is a text document where you 


write and save code. 


Because this is where you will do most of your coding, it is important that you have a lot 
of visual space. That is why we rearranged the RStudio pane layout above—to give the 


Editor more space. 


Now let’s see how to use this Editor. 


First, open a new script under the File menu if one is not yet open: File > New File > 
R Script. In the script, type the following: 


print ("excited for R!") 


To run code, place your cursor anywhere in the code, then hit Command + Enter on 
macOS, or Control + Enter on Windows. 


This should send the code to the Console and run it. 


You can also run multiple lines at once. To try this, add a second line to your script, so 
that it now reads: 


print ("excited for R!") 
prime dandi RSEUdTONN) 


Now drag your cursor to highlight both lines and press Command/Control + Enter. 


To run the entire script, you can use Command/Control + A to select all code, then press 
Command/Control + Enter. Try this now. Deselect your code, then try to the shortcut to 
select all. 


SIDENOTE There is also a ‘Run’ button at the top right of the source panel ( 


foe ) 
poe 


> Run 
), with which you can run code (either the current line, or all highlighted 


code). But you should try to use the keyboard shortcut instead. 


To open the script in a new window, click on the third icon in the toolbar directly above 
the script. 


® ) Untitled1* 
BIA F- 
1 printC"excited for R!") 
2 printC"and RStudio") 
3 
To put the window back, click on the same button on the now-external window. 


Next, save the script. Hit Command/Control + S to bring up the Save dialog box. Give it a 
file name like “rstudio_intro”. 


¢ If you are working with RStudio cloud, the file will be saved in your project folder. 


e If you are working on your local computer, save the file in an easy-to-locate part of 
your computer, perhaps your desktop. (Later on we will think about the “proper” 
way to organize and store scripts). 


You can view data frames (which are like spreadsheets in R) in the same pane. To 
observe this, type and run the code below on a new line in your script: 


View (women) 
Notice the uppercase “V" in View(). 
@ | rstudio_intro.R women 


[al s P Filter 


^ height weight 


1 58 115 
2 59 117 
3 60 120 
4 61 123 
5 62 126 


women is the name of a dataset that comes loaded with R. It gives the average heights and 
weights for American women aged 30-39. 


You can click on the “x” icon to the right of the “women” tab to close this data viewer. 


Console 


The console, at the bottom left, is where code is executed. You can type code directly 
here, but it will not be saved. 


Type a random piece of code (maybe a calculation like 3 + 3) and press ‘Enter’. 


Console Terminal Jobs a fl 
R R 4.1.3 - /cloud/project/ 


~ 


Type 'demo()' for some demos, 'help()' for on-line help, or 
"help.start()' for an HTML browser interface to help. 
Type 'qQ)' to quit R. 


> View(cars) 
>1+1 
[1] 2 
>2 +2 
[1] 4 
>3+ 3 


If you place your cursor on the last line of the console, and you press the up arrow, you 
can go back to the last code that was run. Keep pressing it to cycle to the previous lines. 


To run any of these previous lines, press Enter. 


Environment 


Environment History Connections Tutorial = 
&® H E Import Dataset ~ | > 173 MB ~ | List ~ - 
R ~ f} Global Environment ~ 


At the top right of the RStudio Window, you should see the Environment tab. 


The Environment tab shows datasets and other objects that are loaded into R’s working 
memory, or “workspace”. 


To explore this tab, let’s import a dataset into your environment from the web. Type the 
code below into your script and run it: 


ebola data <- read.csv("https://tinyurl.com/ebola-data-sample") 


You don’t need to understand exactly what the code above is doing for 
SIDENOTE now. We just want to quickly show you the basic features of the 
eee] == Environment pane; we'll look at data importing in detail later. 
EEE 
eed 
ae 
N 


Also, if you do not have active internet access, the code above will not 
run. You can skip this section and move to the “History” tab. 


You have now imported the dataset and stored it in an object named ebola data. (You 
could have named the object anything you want.) 


Now that the dataset is stored by R, you should be able to see it in the Environment pane. 
If you click on the blue drop-down icon beside the object’s name in the Environment tab 
to reveal a Summary. 


(a) Environment History Connections Tutorial 


<P H Import Dataset ~ | > 173 MiB ~ | & 


R ~ f Global Environment ~ 
Ye 
ebola_data 200 obs. of 7 variables 


$ id : int 167 129 270 187 85 2 
$ age : num 55 41 12 NA 20 30 62 
$ sex > chr "M" "M" "F" MEE, 

$ status : chr "confirmed" "confirm 
$ date_of_onset : chr "2014-06-15" "2014-0 
$ date_of_sample: chr "2014-06-21" "2014-0 
$ district : chr "Kenema" "Kailahun" 


Try clicking directly on the ebola_ data dataset from the Environment tab. This opens it 
in a ‘View’ tab. 


You can remove an object from the workspace with the rm() function. Type and run the 
following in a new line on your R script. 


rm(ebola_ data) 


Notice that the ebola_ data object no longer shows up in your environment after having 
run that code. 


The broom icon, at the top of the Environment pane can also be used to clear your 
workspace. 


Packages Heip UIT viewe 


To practice using it, try re-running the line above that imports the Ebola dataset, then 
clear the object using the broom icon. 


History 


Next, the History tab shows previous commands you have run. 


Environment History Connections Tutorial =m 


2 B &æTo Console =ToSource @ x [sd 


2+2 

2+2 

2+ 

4 

ebola_data <- read.csvC"https://tinyurl .com/ebola-data-sample") 
ViewCebola_data) 


You can click a line to highlight it, then send it to the console or to your script with the 
“To Console” and “To Source” icons at the top of this tab. 


To select multiple lines, use the “Shift-click” method: click the first item you want to 
select, then hold down the “Shift” key and click the last item you want to select. 


Finally, notice that there is a search bar at the top right of the History pane where you 
can search for past commands that you have run. 


Files 


Next, the Files tab. This shows the files and folders in the folder you are working in. 


Files Plots Packages Help Viewer Presentation a | 


©) New Folder © New Blank File ~ © | Upload © Delete Rename {jf More ~ 


& Cloud > project > chapter_01_getting_started > scripts RX ee 
A Name Size Modified 
t. 
®) rstudio_intro.R 219 B Mar 18, 2022, 10:21 PM 


The tab allows you to interact with your computer's file system. 


Try playing with some of the buttons here, to see what they do. You should try at least 
the following: 


e Make a new folder 
e Delete that folder 
e Make a new R Script 


e Rename that script 


Plots 


Next, the Plots tab. This is where figures that are generated by R will show up. Try 
creating a simple plot with the following code: 


plot (women) 


Environment History Files Plots Connections Packages Help —™ 


Zoom -Export - © Ø %~ 


weight 
140 160 


120 


5 60 62 64 66 68 70 72 


height 


That code creates a plot of the two variables in the women dataset. You should see this 
figure in the Plots tab. 


Now, test out the buttons at the top of this tab to explore what they do. In particular, try 
to export a plot to your computer. 


Packages 


Next, let’s look at the Packages tab. 


Files Plots Packages Help Viewer Presentation lc 
©) Install @ Update 
Name Description Version 


System Library 


askpass Safe Password Entry for R, Git, and SSH 1.1 
assertthat Easy Pre and Post Assertions 0.2.1 
backports Reimplementations of Functions 1.4.1 


Introduced Since R-3.0.0 


Packages are collections of R code that extend the functionality of R. We will discuss 
packages in detail in a future lesson. 


For now, it is important to know that to use a package, you need to /nsta// then /oad it. 
Packages need to be installed only once, but must be loaded in each new R session. 


All the package names you see (in blue font) are packages that are installed on your 
system. And packages with a checkmark are packages which are /oaded in the current 
session. 


You can install a package with the Install button of the Packages tab. 


©) Install @ Update 


Name Description 


System Library 


But it is better to install and load packages with R code, rather than the Install button. 
Let’s try this. Type and run the code below to install the {highcharter} package. 


install.packages ("highcharter") 
library (highcharter) 


The first line installs the package. The second line /oads the package from your package 
library. 

Because you only need to install a package once, you can now remove the installation line 
from your script. 

Now that the {highcharter} package has been installed and loaded, you can use the 


functions that come in the package. To try this, type and run the code below: 


highcharter: :hchart (womenSweight) 


This code uses the hchart () function from the {highcharter} package to plot an 
interactive histogram showing the distribution of weights in the women dataset. 


(Of course, you may not yet know what a function is. We'll get to this soon.) 


Viewer 


Notice that the histogram above shows up in a Viewer tab. This tab allows you to preview 
HTML files and interactive objects. 


Help 


Lastly, the Help tab shows the documentation for different R objects. Try typing out and 
running each line below to see what this documentation looks like. 


a ineiastate 
?women 
2?read.csv 


Files Plots Packages Help Viewer Presentation Slo 
e oO 
R: Create a highchart object from a particular data type ~ Find in Topic 


hchart {highcharter} R Documentation 


Create a highchart object from a particular 
data type 


Description 
hchart uses highchart to draw a particular plot for an object of a particular 


class in a single command. This defines the S3 generic that other classes and 
packages can extend. 


Help files are not always very easy to understand for beginners, but with time they will 
become more useful. 


bg — 


RStudio options 


RStudio has a number of useful options for changing it’s look and functionality. Let’s try 
these. You may not understand all the changes made for now. That's fine. 


In the RStudio menu at the top of the screen, select Tools > Global Options to bring 
up RStudio’s options. 


e« Now, under Appearance, choose your ideal theme. (We like the “Crimson Editor” 


and “Tomorrow Night” themes.) 


| Options 
General Sky 
Code Editor font: 


= Appearance 
Pane Layout 


Í Packages 


RStudio theme: 


| Editor Font size: 


Editor theme: _ 
Ambiance 
Chaos 


i) R Markdown Chrome 
Clouds Midnight 


# plotting of R objects 
Bm plot <- (x, y, 
{ 


Cis.function(x) && 
is.nullCattr(, 


(missing(y)) 
yass 


# check for ylab argument 
hasylab <- Eaa 
!allCis.naC 
pmatch(names(list(...)), 


Clouds 
@» sweave Cobalt 
Crimson Editor 
| aC) enallinn ieee ChasylabC...)) 


e Under Code > Display, check “Highlight R function calls”. What this does is give 
your R functions a unique color, improving readability. You will understand this later. 


e Also under Code > Display, check “Rainbow parentheses”. What this does is make 
your “nested parentheses” easier to read by giving each pair a unique color. 


Options 
General 
Code 
> Console 


= Appearance 
Pane Layout 
Í Packages 
i) R Markdown 
® Python 
@>» sweave 


ABC 


Z Spelling 


B® cit/svn 


Completion 


Editing BESET Saving 


General 

v) Highlight selected word 
Highlight selected line 

V) Show line numbers 

v) Show margin 
Margin column 
Show whitespace characters 
Show indent guides 

Y Blinking cursor 
Allow scroll past end of document 

Zv) Allow drag and drop of text 


Z) Highlight R function calls 


v) Rainbow parentheses 


Highlight R 


printC"excited for R") function calls 


View(cars) View(cars) 


Rainbow 
parentheses 


(C1 + 1) * 2)^2 m (C1 + 1) * 2542 


printC"excited for R") 


e Finally under General > Basic, uncheck the box that says “Restore .RData into 
workspace at startup”. You don't want to restore any data to your workspace (or 
environment) when you start RStudio. Starting with a clean workspace each time is 


less likely to lead to errors. 


This also means that you never want to “save your workspace to .RData on exit”, 


so set this to Never. 


Command palette 


The Rstudio command palette gives instant, searchable access to many of the RStudio 


menu options and settings that we have seen so far. 


The palette can be invoked with the keyboard shortcut Ctrl + Shift + P (Cmd + Shift +P 


on macOS). 


It’s also available on the 7oo/s menu (Tools -> Show Command Palette). 


A Cato filol/functian ==» Addins » 


a | 


“| Create a New R Script 


Create a new R Markdown document 


Ctrl Alt) Shift) |N] 


Create a new Quarto document 


Create a new R Markdown notebook 12: 


Create a new Shiny web application 


New Terminal 


Open File... 


Try using it to: 


Alt ||Shift| |R| | 
; 


‘ctr1| f0 


e Create a new script (Search “new script” and click on the relevant option) 


«e Rename a script (Search “rename” and click on the relevant option) 


Wrapping up 


Congratulations! You are now a new citizen of RStudio. 


Of course, you have only scratched the surface of RStudio functionality. As you advance in 
your R journey, you will discover new features, and you will hopefully grow to love the 
wonderful integrated development environment (IDE) that is RStudio. One good place to 


Start is the official RStudio IDE cheatsheet. 


Below is one section of that sheet: 


R Support 


Navigate Openinnew Save Findand Compileas Run 
tabs lla 3 ad notebook selected 


code 


@ File : “edit code” View Plots Session Lele * Debug Tools Help 
a Q Hel S la 0 fi f tion.” $- B- Adis = 


2) sčřipt.R x E ee ao 


L# Source -| 2 


ci RMarkdoivn.Rmg% © app.R x 


__ Source on Save 
1 |e Good start. 
: Cursors of Re- run 
shared users previous code 


Ge file 
without Echo outline 
Multiple cursors/column selection 
with Alt + mouse drag. 


Code diagnostics that appear in the margin. 
Hover over diagnostic symbols for details. 


Syntax highlighting based 


getdate fmen i on your file's extension 


C"num" %% (10 ^ n)) Tab completion to finish 
%/% (10 ^ (n - 1)) _. function names, file paths, 
H} ieee arguments, and more. 


fo aa Multi-language code 
Pr snippets to quickly use 
bee common blocks of code. 


© force 


{.GlobalEnv} 


{base} 


Jump to function in file Change file type 


(Top Level) = R Script £ 


Console Compile PDF » R Markdown * =O 


~/\DEcheatsheet/ č ji 
Maximize, 
minimize panes 
Drag pane 
boundaries 


Directory 


-Press ® to see 
command history 


> foo <- function(x) x + 1 
> | foo(2) 

foo(2) 
> foo(1) 


See you in the next lesson! 


Import data file History of past Display .RPres 


with wizard commands to slideshows 
y .fun/add to source File > New File > 
& ~ R Presentation 


garrett > Sessions» <<) @ 


goo = 


© IDEcheatsheet + R 3.2.2 ~+ 
Environment $ History Build Git Presentation » 


Fo.. 


„2 import Dataset» gf... 


: Save 
: workspace 


Search inside ?: 
environment -~ : 


Delete all 
saved objects 


Load 
workspace 


Display objects 


Choose environment to display from 
as list or grid 


list of parent environments 


Data 

Oiris 

Values 
a 1 

Functions 
foo 


15@ obs. of 5 variables 


function (x) 


Viewin data View function 
viewer source code 


Displays saved objects by 
type with short description 


Files Plots Packages Help Viewer a0 
7] New Folder @ Upload ®© | Delete [2 Rename @ More~ C 
“£4 Home | IDEcheatšheet Copy... $|- 

a Nae Move... 3 
: Export... i 
Change 


: Upload Delete Rename 
directory 


: file file file 


Set As Working Directory 
Go To Working Directory 


Create 
folder 


Path to displayed directory 
t 


2) hello.R 4508 Dec 24, 2015, 8:55 AM 


A File browser keyed to your working directory. 
Click on file or directory name to open. 


Further resources 


1. 23 RStudio Tips, Tricks, and Shortcuts 


Contributors 


The following team members contributed to this lesson: 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


LAMECK AGASA 


Statistician/Data Scientist 


References 


Some material in this lesson was adapted from the following sources: 


e “Rstudio Cheatsheets.” RStudio, https://www.rstudio.com/resources/cheatsheets/. 

e “Chapter 1 Getting Started: Data Skills for Reproducible Research.” Chapter 1 Getting 
Started | Data Skills for Reproducible Research, https://psyteachr.github.io/reprores 
-v2/intro.html. 


This work is licensed under the Creative Commons Attribution Share Alike license. 


Lesson notes | Coding basics 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


Introduction 
CONE & tuk, e E Bie Big BAG OE ALG OG Oe Hh Fas aE BPA OSS OR aS ae ah oe 
ae Cee e gra cars aban cen condone 244 este) be oh 46s Meee aeses 64444-54464 3% 
Formatting Code as, gc ek Oe dB ae He OS Aw eee Rea Gee oe ae Saee Raa egwe dak 
OOS Ae. saii ok aah o compe a a tae ees be ee ee oh ee eae eee Bee 
Create an object oackuces eae Gans Pee es oe ede cae eb od ae Bae ae ee Se 
MMe ltr OC: ir eame e E eon news ane nes Kendo ae EE obs Shonen eee ee ae ae 
Datasets are ODJectS TOO 6 454 G5.454 05% Boho HASH Oe RON AE T ER FERRE REE SS 
Rename an object iv acs ecrb ee wctud traced ebe bee bP SEE DRE EME RHE EE SE ESR eT Cee 
Overwrite an object .... 0.0... eens 
WOKING WIN OOIEC(S 624.6444 64a SS 156 006K 45849649 b ons RH ETEERSDESS HATES LAE 
Sme errors WIN ODJECES oni ceed ang dade dG dae ie EE RA dee SORES Oo E D Be 
Noming ee raa eae npc bpd ayo ahd wea are © kG ne ae ek «we eh God Oh Oe 6 a a 
FURCUONS coe seront oak bead eed s bo OE DGS 45-6 OO 4so ak Doe eeu eee ESE 
Basic TUMCOMSYNTƏX . ais cos geras hw ees phere Same hae bh ae ees oh ee oo 
Nesting TUNCHONS : sereisas oh eee eos Pos bin ea Hh odd eye een eed apoE 
eC a a ee ee ee eee eee ee ee ee en ee ee ere ee a aa 
A first example: the {tableone} package «446.04 605 aus oes a5 Kad sewn aes cee eee ees 
FUNSIONI eee ee oer ee eee Serre eae en ewe ee ee See se 
pācmansp-load() i ci pea erene ra hea ae eda eie eau See OES Ae RE hE RAO RADE a EER oHm 
MO WU so E E E EE a AR A Whee es Hea eee Guns Ae Re eae ee oe E 


Learning objectives 


1. You can write comments in R. 

2. You can create section headers in RStudio. 

3. You know how to use R as a calculator. 

4. You can create, overwrite and manipulate R objects. 
5. You understand the basic rules for naming R objects. 
6. You understand the syntax for calling R functions. 

7. You know how to nest multiple functions. 


8. You can use install and load add-on R packages and call functions from these 
packages. 


Introduction 


In the last lesson, you learned how to use RStudio, the wonderful integrated development 
environment (IDE) that makes working with R much easier. In this lesson, you will learn 
the basics of using R itself. 


To get started, open RStudio, and open a new script with File > New File > R Script 


on the RStudio menu. 


| File | Edit Code View Plots Session Build Debug Pr 


New File 1 @ R Script X@SN 


Next, save the script with File > Save onthe RStudio menu or by using the shortcut 
Command/Control +S. This should bring up the Save File dialog box. Save the file with a 
name like “coding_basics”. 


You should now type all the code from this lesson into that script. 


Comments 


There are two main types of text in an R script: commands and comments. A command is 
a line or lines of R code that instructs R to do something (e.g. 2 + 2) 


A comment is text that is ignored by the computer. 


Anything that follows a # symbol (pronounced “hash” or “pound”) on a given line is a 
comment. Try typing out and running the code below to see this: 


Since they are ignored by the computer, comments are meant for Humans. They help you 
and others keep track of what your code is doing. Use them often! Like your mother 
always says, “too much everything is bad, except for R comments”. 


Question 1 


True or False: both code chunks below are valid ways to comment code:? 


# add two numbers 
2o Z 


2+ 2 # add two numbers 


Note: All question answers can be found at the end of the lesson. 


A fantastic use of comments is to separate your scripts into sections. If you put four 
dashes after a comment, RStudio will create a new section in your code: 


# New section ---- 


This has two nice benefits. Firstly, you can click on the little arrow beside the section 
header to fold, or collapse, that section of code: 


1 w# New section ---- 
AR 


Second, you can click on the “Outline” icon at the top right of the Editor to view and 
navigate through all the contents in your script: 


Tel 
Run | 9| +Source ~ 


New section 
l Another section 


R s a calculator 


R works as a calculator, and obeys the correct order of operations. Type and run the 
following expressions and observe their output: 


2 

## [1] 4 
2 

## [1] 0 


2 * 2 # two times two 


2 / 2 # two divided by two 


2 ^ 2 # two raised to the power of two 


## [1] 4 

Be Ge Oh ed # this is evaluated following the order of operations 
## [1] 6 

Sxefiats ((AL00))) # square root 

## [1] 10 


The square root command shown on the last line is a good example of an R function, 
where 100 is the argument to the function. You will see more functions soon. 


all A ee eh ee ee ec ce ee eee | 


We hope you remember the shortcut to run code! 


REMINDER : > i 
To run a single line of code, place your cursor anywhere on that line, 


A then hit Command + Enter on macOS, or Control + Enter on Windows. 


To run multiple lines, drag your cursor to highlight the relevant lines 
then again press Command/Control + Enter. 


eee ee es ed 


Question 2 


In the following expression, which sign is evaluated first by R, the minus or the division? 


ee Ef 


Formatting code 


R does not care how you choose to space out your code. 


For the math operations we did above, all the following would be valid code: 


242 
## [1] 4 
Bae 2 
## [1] 4 
2 a 2 
## [1] 4 


Similarly, for the sqrt () function used above, any of these would be valid: 


eye neve KO) 

## [1] 10 

sqrt ( 100 ) 
## [1] 10 


# you can even space the command out over multiple lines 
Exe pene ( 
100 


But of course, you should try to space out your code in sensible ways. What exactly is 
“sensible”? Well, it may be hard for you to know at the moment. Over time, as you read 


other people's code, you will learn that there are certain R conventions for code spacing 
and formatting. 


In the meantime, you can ask RStudio to help format your code for you. To do this, 
highlight any section of code you want to reformat, and, on the RStudio menu, go to Code 
> Reformat Code, or use the shortcut Shift + Command/Control +A. 


Stuck on the + sign 


If you run an incomplete line of code, R will print a + sign to indicate that 
it is waiting for you to finish the code. 


For example, if you run the following code: 


sqrt (100 


you will not get the output you expect (10). Rather the console will sqrt ( 


and a + sign: 
WATCH OUT 
> sqrt(10e 
R is waiting for you complete the closing parenthesis. You can complete 
the code and get rid of the + by just entering the missing parenthesis: 
) 
> sqrt(100 
+3) 
[1] 10 
Alternatively, press the escape key, ESC while your cursor is in the console 
to start over. 
Objects in R 


Create an object 


When you run code as we have been doing above, the result of the command (or its 
value) is simply displayed in the console—it is not stored anywhere. 


De ae Oro R prines Chro result, 4 but does not Storen Tt 


## [1] 4 


To store a value for future use, assign it to an object with the assignment operator, <- : 


mE ls] <= 2 ar 2 iP alsicuiefa wine wieswilic @ic “2 sp A 


ia) THS loser Cae “iva elo” 
my obj # print my obj 


## [1] 4 


The assignment operator, <- , is made of the ‘less than’ sign, < , and a minus, -. You will 
use it thousands of times over your R lifetime, so please don't type it manually! Instead, 


use RStudio’s shortcut, alt + - (alt AND minus) on Windows or option + - (option AND 
minus) on macOS. 


Also note that you can use the equals sign, =, for assignment. 


my eoj = 272 


But this is not commonly used by the R community (mostly for historical 
reasons), so we discourage it too. Follow the convention and use <-. 


Now that you've created the object my_obj, R knows all about it and will keep track of it 
during this R session. You can view any created objects in the Environment tab of RStudio. 


History Files Plots Connections 


£? H | import Dataset ~ | > 126 MiB ~ | & 
R ~ i Global Environment ~ 
Values 

my_obj 4 


What is an object? 


So what exactly is an object? Think of it as a named bucket that can contain anything. 
When you run the code below: 


my Clo] <= 20 


you are telling R, “put the number 20 inside a bucket named ‘my_obj’”. 
my obj <- 20 


Put the number 20 inside 
an object called `my_obj` 


p” 


Once the code is run, we would say, in R terms, that “the value of object called my obj is 
20”. 


And if you run this code: 


firorinamel- TJoannak 
you are instructing R to “put the value ‘Joanna’ inside the bucket called ‘first name™. 


first_name <- “Joanna” 


Put the value “Joanna” inside an object called 


first_name` 
"aa “Joanna” 


Once the code is run, we would say, in R terms, that “the value of the first name object 
is Joanna’. 


Note that R evaluates the code before putting it inside the bucket. 


So, before when we ran this code, 


my leg] <= 2 ap 2 


R firsts does the calculation of 2 + 2, then stores the result, 4, inside the object. 


my obj <- 2 + 2 


Evaluate ‘2 + 2° then store the result inside 
an object called ‘my_obj~ 


yo umm 2. +2 


Question 3 


Consider the code chunk below: 


weswube <= 2 ar 2a 2 


What is the value of the result object created? 
A.2 +2 + 2 
B. numeric 


C. 6 


Datasets are objects too 


So far, you have been working with very simple objects. You may be thinking “Where are 
the spreadsheets and datasets? Why are we writing my obj <- 2 + 2? ls this a primary 
school maths class?!” 


Be patient. 


We want you to get familiar with the concept of an R object because once you start 
dealing with real datasets, these will also be stored as R objects. 


Let's see a preview of this now. Type out the code below to download a dataset on Ebola 
cases that we stored on Google Drive and put it in the object 


ebola sierra leone data. 


ebola_ sierra leone data <- read.csv("https://tinyurl.com/ebola-data-sample") 


a_ sierra leone data # print ebola data 


ebol 


Ht id age sex status date of onset date of sample district 
## 1 167 55 M confirmed 2014-06-15 2014-06-21 Kenema 
## 2 129 41 M confirmed 2014-06-13 2014-06-18 Kailahun 
## 3 270 12 F confirmed 2014-06-28 2014-07-03 Kailahun 
## 4 187 NA F confirmed 2014-06-19 2014-06-24 Kailahun 
## 5 85 20 M confirmed 2014-06-08 2014-06-24 Kailahun 


This data contains a sample of patient information from the 2014-2016 Ebola outbreak in 
Sierra Leone. 


Because you can Store datasets as objects, its very easy to work with multiple datasets at 
the same time. 


Below, we import and view another dataset from the web: 


diabetes china <- read.csv("https://tinyurl.com/diabetes-china") 


Because the dataset above is quite large, it may be helpful to look at it in the data viewer: 


View(diabetes_ china) 


Notice that both datasets now appear in your Environment tab. 


Rather than reading data from an internet drive as we did above, it is 
more likely that you will have the data on your computer, and you will 
want to read it into R from your there. We will cover this in a future 
lesson. 


SIDE NOTE 


T D 


Later in the course, we will also show you how to store and read data 
from a web service like Google Drive, which is nice for easy portability. 


Rename an object 
You sometimes want to rename an object. It is not possible to do this directly. 


To rename an object, you make a copy of the object with a new name, and delete the 
original. 


For example, maybe we decide that the name of the ebola _ sierra _ leone data object 
is too long. To change it to the shorter *ebola_data”" run: 


ebola_ data <- ebola_sierra_ leone data 


This has copied the contents from the ebola sierra leone data bucket to anew 
ebola data bucket. 


You can now get rid of the old ebola _ sierra leone data bucket with the rm () 
function, which stands for “remove”: 


rm (ebola sierra leone data) 


Overwrite an object 
Overwriting an object is like changing the contents of a bucket. 


For example, previously we ran this code to store the value “Joanna” inside the 
first name object: 


first_name <- "Joanna" 


To change this to a different, simply re-run the line with a different value: 


firsrinanek <= WV ibiblalyeyat 
You can take a look at the Environment tab to observe the change. 


Working with objects 


Most of your time in R will be spent manipulating R objects. Let’s see some quick 
examples. 


You can run simple commands on objects. For example, below we store the value 100 in 
an object and then take the square root of the object: 


my number <- 100 
sqrt (my number) 


R “sees” my number as the number 100, and so is able to evaluate it’s square root. 


You can also combine existing objects to create new objects. For example, type out the 
code below to add my number to itself, and store the result in a new object called my_sum: 


my sum <- my number + my number 


What should be the value of my_sum? First take a guess, then check it. 


SIDE NOTE To check the value of an object, such as my sum, you can type and run 
S just the code my _ sum in the Console or the Editor. Alternatively, you can 
simply highlight the value my _ sum in the existing code and press 
Command/Control + Enter. 


But of course, most of your analysis will involve working with data objects, such as the 
ebola_data object we created previously. 


Let's see a very simple example of how to interact with a data object; we will tackle it 
properly in the next lesson. 


To get a table of the different sex distribution of patients in the ebola_ data object, we 
can run the following: 


table (ebola_ dataSsex) 


## 
## F M 
## 124 76 


The dollar sign symbol, $, above allowed us subset to a specific column. 
Question 4 


a. Consider the code below. What is the value of the answer object? 


eight <- 9 
answer <- eight - 8 


b. Use table() to make a table with the distribution of patients across districts in the 
ebola data object. 


Some errors with objects 


meee MaS E MEA 
last_name <- "Fenway" 


igi iavehines firstname a> dels iavell 


Error in first name + last name : non-numeric argument to binary operator 


The error message tells you that these objects are not numbers and therefore cannot be 
added with +. This is a fairly common error type, caused by trying to do inappropriate 
things to your objects. Be careful about this. 


In this particular case, we can use the function paste () to put these two objects 
together: 


full name <- paste(first_name, last _name) 
full name 


## [1] "Luigi Fenway" 


Another error you'll get a lot isError: object 'XXX' not found. For example: 


myanumber =< S48 7 denine | miopi 
Myanumber 12 7 attempt coraa ton mrep 


Error: object 'My number' not found 


Here, R returns an error message because we haven't created (or defined) the object 
My obj yet. (Recall that R is case-sensitive.) 


When you first start learning R, dealing with errors can be frustrating. They're often 
difficult to understand (e.g. what exactly does “non-numeric argument to binary 
operator’ mean?). 

Try Googling any error messages you get and browsing through the first few results. This 


will lead you to forums (e.g. stackoverflow.com) where other R learners have complained 
about the same error. Here you may find explanations of, and solutions to, your problems. 


Question 5 


a. The code below returns an error. Why? 


DYE ies Manes <—s whonel 


PRACTICE 
A my deiso nene <- "Nwosu" 
my first name + my last _name 
(in RMD) 


b. The code below returns an error. Why? (Look carefully) 


niye deie acina <> enone! 
my last name <- "Nwosu" 


paste(my Ist name, my last name) 


Naming objects 


There are only two hard things in Computer Science: cache invalidation and 
naming things. 


— Phil Karlton. 


Because much of your work in R involves interacting with objects you have created, 
picking intelligent names for these objects is important. 


Naming objects is difficult because names should be both short (so that you can type 
them quickly) and informative (so that you can easily remember what is inside the 
object), and these two goals are often in conflict. 


So names that are too long, like the one below, are bad because they take forever to type. 


sample of the ebola outbreak dataset _from_sierra_leone in 2014 


And a name like data is bad because it is not informative; the name does not give a good 
idea of what the object is. 


As you write more R code, you will learn how to write short and informative names. 


For names with multiple words, there are a few conventions for how to separate the 
words: 


snake case <- "Snake case uses underscores" 
period.case <- "Period case uses periods" 
camelCase <- "Camel case capitalizes new words (but not the first word)" 


We recommend snake_case, which uses all lower-case words, and separates words with _. 


Note too that there are some limitations on objects’ names: 


e Names must start with a letter. So 2014 data is not a valid name (because it starts 
with a number). 


e names can only contain letters, numbers, periods (.) and underscores (_). So 
ebola-data or ebola~data or ebola data with a space are not valid names. 


If you really want to use these characters in your object names, you can enclose the 
names in backticks: 


`ebola-data` 
`ebola~data` 
`ebola data` 


All of the above are valid R object names. For example, type and run the following code: 


“eloollerCleice, <= Soola iclenee 
nebollaa deta 


But in general you should avoid using backticks to rescue bad object names. Just write 
proper names. 


Question 6 


In the code chunk below, we are attempting to take the top 20 rows of the ebola_ data 
table. All but one of these lines has an error. Which line will run properly? 


Z0NEops rows! <> head (cbhollandata,, 20) 
twenty-top-rows <- head(ebola data, 20) 
top 20 rows <- head(ebola data, 20) 


Functions 


Much of your work in R will involve calling functions. 


You can think of each function as a machine that takes in some input (or arguments) and 
returns some output. 


inputs 
(arguments) 


— ’> 


outputs 


So far you have already seen many functions, including, sqrt (), paste () and plot (). 
Run the lines below to refresh your memory: 


sqrt (100) 


paste("I am number", 2 + 2) 
plot (women) 


Basic function syntax 


The standard way to call a function is to provide a value for each argument: 


function mamnesilargument dl = Myer, argument 4 uvae) 


Let’s demonstrate this with the head() function, which returns the first few elements of 
an object. 


To return the first three rows of the Ebola dataset, you run: 


headi(x == cbolaltdata, ny = 3) 


Ht id age sex status date of onset date of sample district 
## 1 167 55 M confirmed 2014-06-15 2014-06-21 Kenema 
## 2 129 41 M confirmed 2014-06-13 2014-06-18 Kailahun 
## 3 270 12 F confirmed 2014-06-28 2014-07-03 Kailahun 


In the code above, head () takes in two arguments: 
e x, the object of interest, and 
e n, the number of elements to return. 


We can also swap the order of the arguments: 


heec = Si, = = Slovene! Care) 


Ht id age sex status date of onset date of sample district 
## 1 167 55 M confirmed 2014-06-15 2014-06-21 Kenema 
## 2 129 41 M confirmed 2014-06-13 2014-06-18 Kailahun 
## 3 270 12 F confirmed 2014-06-28 2014-07-03 Kailahun 


If you put the argument values in the right order, you can skip typing their names. So the 
following two lines of code are equivalent and both run: 


head, = ebholladaita, ni = 3) 


Ht id age sex status date of onset date of sample district 
## 1 167 55 M confirmed 2014-06-15 2014-06-21 Kenema 
## 2 129 41 M confirmed 2014-06-13 2014-06-18 Kailahun 
## 3 270 12 F confirmed 2014-06-28 2014-07-03 Kailahun 


head(ebola data, 3) 


Ht id age sex status date of onset date of sample district 
## 1 167 55 M confirmed 2014-06-15 2014-06-21 Kenema 
## 2 129 41 M confirmed 2014-06-13 2014-06-18 Kailahun 
## 3 270 12 F confirmed 2014-06-28 2014-07-03 Kailahun 


But if the argument values are in the wrong order, you will get an error if you do not type 
the argument names. Below, the first line runs but the second does not run: 


neern = Sip = = Cloyoilel Ceka) 
head(3, ebola data) 


(To see the “correct order” for the arguments, take a look at the help file for the head () 
function) 


Some function arguments can be skipped altogether, because they have default values. 


For example, with head (), the default value of n is 6, so running just head(ebola_ data) 
will return the first 6 rows. 


head(ebola_ data) 


id age sex status date of onset date of sample district 
1 167 55 M confirmed 2014-06-15 2014-06-21 Kenema 
2 129 41 M confirmed 2014-06-13 2014-06-18 Kailahun 
3 270 12 F confirmed 2014-06-28 2014-07-03 Kailahun 
4 187 NA F confirmed 2014-06-19 2014-06-24 Kailahun 
5- 85. 20 M confirmed 2014-06-08 2014-06-24 Kailahun 
6 277 30 F confirmed 2014-06-29 2014-07-01 Kenema 


To see the arguments to a function, press the Tab key when your cursor is inside the 
function's parentheses: 


head(|) 
x= x 
@...= an object 
Press F1 for additional help 
Question 7 


In the code lines below, we are attempting to take the top 6 rows of the women dataset 
(which is built into R). Which line is invalid? 


head (women) 

head(women, 6) 

head ( 
( 
( 


x = women, 6) 
head(x = women, n = 6) 
head(6, women) 


(If you are not sure, just try typing and running each line. Remember that the goal here is 
for you to gain some practice.) 


Let’s spend some time playing with another function, the paste () function, which we 


already saw above, This function is a bit special because it can take in any number of input 


arguments. 


So you could have two arguments: 


Paste ("Luigi", "Fenway™) 

## [1] "Luigi Fenway" 

Or four arguments: 

paste("Luigi", "Fenway", "Luigi", "Fenway") 
## [1] "Luigi Fenway Luigi Fenway" 


And so on up to infinity. 
And as you might recall, we can also paste () named objects: 


fior jovetuks, <<— ATUT 
pasta (My meme sw miret nems Wemel mw Waist mame isk, least meme) 


## [1] "My name is Luigi and my last name is Fenway" 


Functions like paste () can take in many values because they have a 
special argument, an ellipsis: ... If you consult the help file for the paste 
PRO TIP function, you will see this: 


Wx, 


Arguments 


one or more R objects, to be converted to character vectors. 


Another useful argument for paste () is called sep. It tells R what character to use to 
separate the terms: 


pasirer(( hunlgaly henwayn, Sep! — l=") 


## [1] "Luigi-Fenway" 


20 


Nesting functions 


The output of a function can be immediately taken in by another function. This is called 
function nesting. 


For example, the function tolower () converts a string to lower case. 


tolower ("LUIGI") 


## [1] "luigi" 


You can take the output of this and pass it directly into another function: 


paste (tolower ("LUIGI"), "is my name") 


## [1] “luigi is my name" 


Without this option of nesting, you would have to assign an intermediate object: 


my lowercase name <- tolower ("LUIGI") 
paste (my lowercase name, "is my name") 


## [1] “luigi is my name" 


Function nesting will come in very handy soon. 
Question 8 


The code chunks below are all examples of function nesting. One of the lines has an error. 
Which line is it, and what is the error? 


sqrt (head (women) ) 
joaswet(sepec (2), Modine Ib aks 2 isxepeic e) 


sqrt (tolower ("LUIGI") ) 


———— 


Packages 


As we mentioned previously, R is wonderful because it is user extensible: anyone can 
create a software package that adds new functionality. Most of R’s power comes from 


21 


these packages. 
In the previous lesson, you installed and loaded the {highcharter} package using the 


install.packages() and library () functions. Let's learn a bit more about packages 
now. 


A first example: the {tableone} package 


Let's now install and use another R package, called tableone: 


install.packages ("tableone") 


library (tableone) 


Note that you only need to install a package once, but you have to load it with library () 
each time you want to use it. This means that you should generally run the 


install.packages () line directly from the console, rather than typing it into your 
Script. 


The package eases the construction of “Table 1°, i.e. a table with characteristics of the 
study sample that is commonly found in biomedical research papers. 


The simplest use case is Summarizing the whole dataset. You can just feed in the data 
frame to the data argument of the main workhorse function CreateTableOne (). 


CreateTableOne (data = ebola data) 


Overall 
n 200 
id (mean (SD) ) 146.00 
age (mean (SD)) 33612 
sex = M (%) 76 
status = suspected (%) 18 
date _of onset (%) 
2014-05-18 
2014-05-20 


WO oO ~l 
©: 
<~ 


2014-0 
2014-0 


2014-0 
2014-0 


2014-0 
2014-0 
2014-0 


2014-0 


5=21 
5-22 


2014-05-23 
2014-05-24 


5-26 
5=27 


2014-05-28 
2014-05-29 


5=30 
5=31 
6-01 


2014-06-02 
2014-06-03 


6-05 


FrPeNN b OHH «l O DH NY H H ja 


O O O FEFN BO WA hH O H OGOGO 


G g OG OGO O O Gi OG G O OGO O GA O wo 


22 


ay Satay, So, ea a ay ce ees Sa Bes) ace. ee, Sees ee, se ee, eee: 


Sms. ioe. cea a ees eee piel, Go. Se, en ey ce ee oe ee! ee ee. ae ok, Veet ees See, wees Geen, Vee. ce gee Se | a, ee Se, 


wH D O O ei OG A T rO S Au CO Art AN TATA OC O a a Y MM AN 
wH wH 
NOM tA NTA rFrNnnonwntAtTFTNONWDMrN AN OUYO OD I AA TA NTN DO AD TANNA NAN TOM AA ANNAN DAr VM NWN N 
N mi mi wH N rf 
ale 
0) 
Or OA OXoHt AMO STH Or OH ot A OYTO OM OAT AM NO” @O@AoAAWN YO TH OF OANA M T OO Orm OM AN MO St Oe Cc 
Or OS 1 1 1 1 1 1 1 1 1 INN ANANN NNA NNA NN ANNAN NA NM OOO OOOO 1 1 1 1 1 1 1 1 1 TNNN NNN ON 
I l I l I ih LA (| (ae Be Ih (| I i I ol FF. Sil Lok & a loot I i | bib sil ev | I l I ot | L A i- I l (eee I ot | l 
LO LO IO IO O IO IO IO IO IO IO IO IO IO IO IO IO IO IO O O O OM @ LIN LIN IN LN LN LN 1N lO lO IO IO IO IO IO IO IO IO IO IO IO IO IO IO IO IO O WOM WOW WO 
oo D O a O DT O aO G S aO a aO G OGO G G O O a Oo fo T STO GO O O G OGO aTr O aO GO Coc GrG OO a O OC OC a a G O O a CLO 
I Li I ot ae Koh | ae I l | sil E | I ot b f l I l Re | I l LE 1 D | l i E I l bal bl I ot =k Ree | I l l 
Aoa g a e T G a a G G a AS G a a G G e a G G A a G G e a a G a e a G g a e G G a a G G G a SS G G DS a g 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 O 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
SP OS OS QO O GOGG OO 1S: GOG SS! O 1? (SO! QO 1S Oa O 1" sa 1S |o oOo oa OO CO Oo OO CO COO CO OCC COe CO OC OOo COO oO oO oO CO Oo SO 
NANA NANANA NNA NNA NN NNA NN NN NN NN NAN D NNNNNNNNNNNNNNNNNNNNNN O N E N ON N N N ON 
O 
xe) 
+ HHH HH OH OH + OH OH + HHH HH HHH H HH HH HH HE HH HH HH HH HH + OH OH OH + HHH HH OH OH OH 
s+ HOHE HE se OE s+ HE HE HE se OE + HOHE HE se OE s+ HOHE HE se OE + se OE s+ HE HE HE s+ HEHE HE + SE HE se OE 


23 


2014-06-29 8 ( 4.0) 
2014-06-30 6. (. 3.0) 
2014-07-01 4 ( 2.0) 
2014-07-02 16 ( 8.0) 
2014-07-03 13: (65:5) 
2014-07-04 2 (1) 
2014-07-05 2 ( 1.0) 
2014-07-06 T ¢ 0.5) 
2014-07-08 3. (15) 
2014-07-12 (04:5) 
2014-07-14 (0.5) 
2014-07-17 1 (0.55) 
2014-07-21 1 (0.5) 
district (%) 
Bo 4 ( 2.0) 
Kailahun 146 (73.0) 
Kenema 41 (20.5) 
Kono 2 ¢. 1,0) 
Port Loko 2 (1.0) 
Western Urban Sle 2.5) 


You can see there are 200 patients in this dataset, the mean age is 33 and 38% of the 
sample of the sample is male, among other details. 


Very cool! (One problem is that the package is assuming that the date variables are 
categorical; because of this the output table is much too long!) 


The point of this demonstration of {tableone} is to show you that there is a lot of power 
in external R packages. This is a big strength of working with R, an open-source language 
with a vibrant ecosystem of contributors. Thousands of people are working right now on 
packages that may be helpful to you one day. 


You can Google search “Cool R packages” and browse through the answers if you are 
eager to learn about more R packages. 


SIDE NOTE l 
errer) You may have noticed that we embrace package names in curly braces, 
m= e.g. {tableone}. This is just a styling convention among R users/teachers. 
— The braces do not mean anything. 


Full signifiers 


The full signifier of a function includes both the package name and the function name: 
package::function(). 


So for example, instead of writing: 


CreateTableOne (data = ebola data) 


24 


We could write this function with its full signifier, package: : function (): 


tableone: :CreateTableOne (data = ebola data) 


You usually do not need to use these full signifiers in your scripts. But there are some 
situations where it is helpful: 


The most common reason is that you want to make it very clear which package a function 
comes from. 


Secondly, you sometimes want to avoid needing to run library (package) before 
accessing the functions in a package. That is, you want to use a function from a package 
without first loading that package from the library. In that case, you can use the full 
signifier syntax. 


So the following: 


tableone: :CreateTableOne (data = ebola data) 


is equivalent to: 


library (tableone) 
CreateTableOne (data = ebola data) 


Question 9 


Consider the code below: 


Gabilleone:: CueavelabileOne (data = elole data) 


Which of the following is a correct interpretation of what this code 


PRACTICE _. eans: 


(in RMD) A. The code applies the CreateTableOne function from the {tableone} 
package on the ebola data object. 


B. The code applies the CreateTableOne argument from the {tableone} 
function on the ebola_ data package. 


C. The code applies the CreateTableOne function from the {tableone} 
package on the ebola data package. 


pacman::p_load() 


Rather than use two separate functions, install.packages() then library (), to install 
then load packages, you can use a single function, p_load(), from the {pacman} package 
to automatically install a package if it is not yet installed, and load the package. We 
encourage this approach in the rest of this course. 


Install {pacman} now by running this in your console: 
install.packages ("pacman") 


From now on, when you are introduced to a new package, you can simply use, 
pacman::p load(package name) to both install and load the package: 


Try this now for the outbreaks package, which we will use soon: 


pacman: :p load(outbreaks) 


Now we have a small problem. The wonderful function pacman: :p_ load() automatically 
installs and loads packages. 


But it would be nice to have some code that automatically installs the {pacman} package 
itself, if it is missing on a user’s computer. 


But if you put the install.packages () line in a script, like so: 


install.packages ("pacman") 
pacman::p load(here, rmarkdown) 


you will waste a lot of time. Because every time a user opens and runs a script, it will 
reinstall {pacman}, which can take a while. Instead we need code that first checks 
whether pacman is not yet installed and installs it if this is not the case. 


We can do this with the following code: 


if(!require(pacman)) install.packages ("pacman") 


You do not have to understand it at the moment, as it uses some syntax that you have not 
yet learned. Just note that in future chapters, we will often start a script with code like 
this: 


if(!require(pacman)) install.packages ("pacman") 
pacman::p load(here, rmarkdown) 


The first line will install {pacman} if it is not yet installed. The second line will use 
p_load() function from {pacman} to load the remaining packages (and 
pacman::p_load() installs any packages that are not yet installed). 


Phew! Hope your head is still intact. 
Question 10 


At the start of an R script, we would like to install and load the package called {janitor}. 
Which of the following code chunks do we recommend you have in your script? 


A. 


if(!require(pacman)) install.packages ("pacman") 
pacman: :p load(janitor) 


install packages( januvor) 
library (Janitor) 


C. 


installi packages anr tori) 
pacmanti: P load(janitor) 


Wrapping up 
With your new knowledge of R objects, R functions and the packages that functions come 


from, you are ready, believe it or not, to do basic data analysis in R. We'll jump into this 
head first in the next lesson. See you there! 


Answers 
1. True. 
2. The division sign is evaluated first. 


3. The answer is C. The code 2 + 2 + 2 gets evaluated before it is stored in the 
object. 


4. a. The value is 1. The code evaluates to 9-8. 
b. table(ebola_data$district) 
5. a. You cannot add two character strings. Adding only works for numbers. 


b. my 1st_name is typed with the number 1 initially, but in the paste () command, 
it is typed with the letter “I”. 


6. The third line is the only line with a valid object name: top 20 rows 


7. The last line, head(6, women), is invalid because the arguments are in the wrong 
order and they are not named. 


8. The third code chunk has a problem. It attempts to find the square root of a 
character, which is impossible. 


9. The first line, A, is the correct interpretation. 


10. The first code chunk is the recommended way to install and load the package 
{janitor} 


Contributors 
The following team members contributed to this lesson: 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


LAMECK AGASA 


Statistician/Data Scientist 


OLIVIA KEISER 


Head of division of Infectious Diseases and Mathematical Modelling, 
University of Geneva 


References 
Some material in this lesson was adapted from the following sources: 
e “File:Apple slicing function.png.” Wikimedia Commons, the free media repository. 1 
Oct 2021, 04:26 UTC. 20 Mar 2022, 17:27 <https://commons.wikimedia.org/w/index 
.php?etitle=File:Apple_slicing_function.png&oldid=594767630>. 


e “PsyteachR | Data Skills for Reproducible Research.” 2021. Github.io. 2021. https:// 
psyteachr.github.io/reprores-v2/index.html. 


« Douglas, Alex, Deon Roos, Francesca Mancini, Ana Couto, and David Lusseau. 2022. 
“An Introduction to R.” Intro2r.com. January 27, 2022. https://intro2r.com/. 


This work is licensed under the Creative Commons Attribution Share Alike license. 


Lesson notes | Data dive: Ebola in Sierra 
Leone 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


PPE AIP ayer aris neuen eee ents aie cau we onan eee s dee ee E E D toe geaen 
SCPE ok aye. 4 eh SG wes Gag OG HCG OG oe Resa s oe BAPE OSS Os d eG a ane 
PUP segere se es 4 6 iaee oe Sarde Oy oan, G BGG Hed HSE Oe OR E 64.944 2 ok 
Package n goera eas oe Ge eee Bs Eee be AS ek Re 6 bee 
importing data iNtO a  dio-bh:g ekg ante ee pik 4.2 ob edia coed ie aut en gam nes Fe ad wR Ke 
Intro tò reproducibility oaa-cccand ead a dn Poa he stb ati Oe Bees ee Aone a eons der a ee 
QUICK Cate Cx OOOO: easan eanna naaa a EARS E SSAA ORY he eR eke ae eRe 
VIS Gat () 6s cess 86 SHG OFA SERGE OES ETE E ESR EOR EES OEE EEL REEDY ESET S BEER OS EES 
inspect cat {) SW inspect AU () 40co5o 2044044404444 esd Ede RAe REE RE it EOE 
Analyzing a single numeric variable ........ 0... eeen 
Extract a column vector WITA Sino odes ct edad iW be ode darii Daan end cee aaaws 
Basic operations on a numeric variable ...... 0.00.0... ee 
Visualizing a numeric variable ........ 20.0... 0c tee eee 
Analyzing-a Single categorical VallaDle. ..65 46% 400i cue doa arnvedeseeev are tobe neawe 
SN tables: pik nyora Rd Ee AEDES OSE DOS OS OARS DEERE S OO 
Visualizing á Categorical Variable ..occacadsed 244456044004 20 06049 E940OE45 boo 44005 
Answering questions about the outbreak ...... 0.0.0.0... eee 
Haven't had enough? ao castes cabana $e eed bees bed besos bene cae EHR EEE Se 
WrappiNhg UP. sse ate det ey eee ee Sw at eee. ee a pe eine oe ee SS ee Se Se 


Learning objectives 


. You can use RStudio’s graphic user interface to import CSV data into R. 
. You can explain the concept of reproducibility. 


. You can use the nrow(),ncol() and dim() functions to get the dimensions of a 


dataset, and the summary () function to get a summary of the dataset’s variables. 


. You can use vis dat(), inspect num() and inspect cat () to obtain visual 


summaries of a dataset. 


. You can inspect a numeric variable: 


o with the summary functions mean () ,median(),max(),min(), length() and 
sum (); 


o with esquisse-generated ggplot2 code. 


. You can inspect a categorical variable: 


o with the summary functions table() and janitor::tabyl (); 


o with the graphical functions barplot() and pie(). 


Introduction 


With your newly-acquired knowledge of functions and objects, you now have the basic 
building blocks required to do simple data analysis in R. So let’s get started. The goal is to 
start working with data as quickly as possible, even before you feel ready. 


Here you will analyze a dataset of confirmed and suspected cases of Ebola hemorrhagic 
fever in Sierra Leone in May and June of 2014 (Fang et al., 2016). The data is shown below: 


You will import and explore this dataset, then use R to answer the following questions 
about the outbreak: 


When was the first case reported? 

What was the median age of those affected? 

Had there been more cases in men or women? 

What district had had the most reported cases? 

By the end of June 2014, was the outbreak growing or receding? 


Script setup 


First, open a new script in RStudio with File > New File > R Script. (If you are on 
RStudio, you can open up any of your previously-created projects.) 


| File | Edit Code View Plots Session Build Debug Pr 


New File ® R Script XO 98N 


Next, save the script with File > Save As or press Command/Control + S to bring up 
the Save File dialog box. Save the file with the name “ebola_analysis” or something similar 


Empty your environment at the start of the analysis 


SIDE NOTE 


When you start a new analysis, your R environment should usually be 
empty. Verify this by opening the Environment tab; it should say 
“Environment is empty”. If instead, it shows some previously-loaded 
objects, it is recommended to restart R by going to the menu option 
Session > Restart R 


Header 


Add a title, name and date to the start of the script, as code comments. This is generally 


good practice for writing R scripts, as it helps give you and your collaborators context 
about your script. Your header may look like this: 


# Ebola Sierra Leone analysis 
# John Sample-Name Doe 
# 2024-01-01 


Packages 


Next, use the p_load() function from {pacman} to load the packages you will be using. 
Put this under a section header called “Load packages”, with four hyphens, as shown 
below: 


# Load packages ---- 
if (!require(pacman)) install.packages ("pacman") 
pacman: :p_ load ( 

tidyverse, # meta-package 


inspectdf, 
joubonedly7 
janitor, 
visdat, 
esquisse 

) 

Pees eee eee eee eee eee 
Remember that the fu// signifier of a function includes both the package 
name and the function name, package: : function (). This full signifier 
is handy if you want to use a function before you have loaded its source 
package. This is the case in the code chunk above: we want use p_load() 
from {pacman} without formally loading the {pacman} package, so we 
type pacman: :p_load() 

REMINDER 


We could also first load {pacman} before using the p_load function: 


A 


library(pacman) # first load {pacman} 
p_load(tidyverse) # use ~p load’ from {pacman} to load other 
packages 


(Also recall that the benefit of p _ load () is that it automatically installs a 
package if it is not yet installed. Without p_load (), you have to first 
install the package with install.packages() before you can load it 
with library ().) 


re A a a a a a a a a a a a a a a a a a a a a a a a a a a a 
ee ee ee ee 


Importing data into R 


Now that the needed packages are loaded, you should import the dataset. 


About the Ebola dataset 


sipe NOTE The data you will be working on contains a sample of patient information 
from the 2014-2016 Ebola outbreak in Sierra Leone. It comes from a 
research paper which analyzed the transmission dynamics of that 
outbreak. Key variables include the status of a case, whether the case 


a 


SIDE OTE sample was taken. To learn more about these data, visit the source 
mw) publication here: bit.ly/ebola-data-source. Or search the following DOI on 
DOl.org: 10.1073/pnas.1518587113. 


Go to bit.ly/view-ebola-data to view the dataset you will be working on. Then click the 
download icon at the top to download it to your computer. 


Open with ~» 
D 
1 id age sex status date_of_onset date_of_sample district 
2 92 6 M confirmed 2014-06-10 2014-06-15 Kailahun 
3 51 46 F confirmed 2014-05-30 2014-06-04 Kailahun 


You can leave the dataset in your downloads folder, or move it to somewhere more 
respectable; the upcoming steps will work independent of where the data is stored. In the 
next lesson, you will learn how to organize your data analysis projects properly, and we 
will think about the ideal folder setup for storing data. 


NOTE: If you are using RStudio Cloud, you need to upload your dataset to 
the cloud. Do this in the “Files” tab by clicking on the “Upload” button. 


RSTUDIO 
CLOUD 


Plots Packages Help Viewer Presentation Sl og 
Q New Folder © New Blank File ~ |G Upload |@ Delete )Rename | {$ More ~ 


@& Cloud > project E 
4 Name Size Modified 


È. 


Next, on the RStudio menu, go to File > Import Dataset > From Text (readr). 


[Fie] | Edit Code View Plots Session Build Debug Pre 


New File > »/function ~ Addin 
<= Open File... #O 5 
<= Open File in New Column... 

Recent Files > 


yoy Share Project... 


Import Dataset b From Text (base)... 
Save ges From Text (readr)... 


Browse through the computer's files and navigate to the downloaded dataset. Click to 
open it. You should see an import dialog box like this: 


Import Text Data 


File/URL: 


_/cloud/project/ebola_sierra_leone.csv Browse... 


Data Preview: 


Looe = eee eee n EPOST Laer tS en D 


Import Options: 


Name: ‘ebola_sierra_leone ¥)First Row Delimiter: Escape: 

Skip: | 0 A Sai Quotes: Comment: 
Spaces Locale: NA: 
Z Open 
Data Viewer 


Leave all the import settings at the default values; simply click on “Import” at the bottom; 
this should load the dataset into R. You can tell this by looking at your environment pane, 
which should now feature an object called “ebola_sierra_leone” or something similar: 


R ~ f Global Environment ~ 


Data 
© ebola_sierra_leone 200 obs. of 7 variables 


RStudio should also have called the View() function on your dataset, so you should see a 
familiar spreadsheet view of this data: 


| l s Filter 
^ id age sex status date_of_onset date_of_sample district 
1 92 6.0 M confirmed 2014-06-10 2014-06-15 Kailahun 
2 51 46.0 F confirmed 2014-05-30 2014-06-04 Kailahun 
3 230 M confirmed 2014-06-26 2014-06-30 Kenema 
4 139 25.0 F confirmed 2014-06-13 2014-06-18 Kailahun 


Now take a look at your console. Do you observe that your actions in the graphical user 
interface actually triggered some R code to be run? Copy the line of code that includes 


the read_csv() function, leaving out the > symbol. 


an l Copy this 
i (or something similar) 


>| ebola_sierra_leone <- read_csv(" 
Rows: 200 Columns: 7 
— Column specification 


Paste the copied code into your R script, and label this section “Load data”. This may look 
something like the below (the file path inside quotes will differ from computer to 
computer. 


i? Boae Clee) == 
ebola sierra leone <- read csv ("-/Downloads/ebola sierra leone.csy™) 


Nice work so far! 


Your R script should look similar to this: 


# Ebola Sierra Leone analysis 
# John Sample-Name Doe 
# 2024-01-01 
RECAP # Load packages ---- 
if(!require (pacman) ) install.packages ("pacman") 
pacman: -piigad 
tidyverse, 
inspectdf, 
prioty, 
janitor, 
visdat 
) 


i? dlyeyekel (eleliege) == — 
asol enem e EOE <= 
read_csv("~/Downloads/ebola_ sierra leone.csv") 


Intro to reproducibility 


Now that the code for importing data is in your R script, you can easily rerun this script 
anytime to reimport the dataset; there will be no need to redo the manual point-and-click 
procedure for data import. 


Try restarting R and rerunning the script now. Save your script with Control/Command + 
s , then restart R with the RStudio Menu, at Session > Restart R. On RStudio Cloud, 
the menu option looks like this: 


| session ET 


Interrupt R 


Terminate R... 


“™ Restart R 80 
- Restart R - 


M-a nrar Ae O At 


. 


If restarting is successful, your console should print this message: 


Restarting R session... 


>| 


You should also see the phrase “Environment is empty” in the Environment tab, indicating 
that the dataset you imported is no longer stored by R-you are starting with a fresh 
workspace. 


Environment History Files Plots Connections Packages Help 


<& H import Dataset ~?” 124MiB ~ 8 
R ~ f Global Environment ~ 


Environment is empty 


To re-run your script, use Command/Control +a to highlight all the code, then 
Command/Control + Enter to run it. 


If this worked, congratulations; you have the beginnings of your first “reproducible” 
analysis script! 


What does “reproducible” mean? 


When you do things with code rather than by pointing and clicking, it is 


VOCAB easy for anyone to re-run, or reproduce these steps, by simply re-running 
| p your script. 


While you can use RStudio’s graphical user interface to point-and-click 
your way through the data import process, you should always copy the 
relevant code to your script so that your script remains a reproducible 
record of all your analysis steps. 


Of course, your script so far is not yet entire/y reproducible, because the 

file path for the dataset (the one that looks like this: “...intro-to-data- 

A | analysis-with-r/ch0O1_getting_started/data...") is specific to just your 

=] computer. Later on we will see how to use relative file paths, so that the 1 
code for importing data can work on anyone's computer. : 


If your environment was not empty after restarting R, it means you 
skipped a step in a previous lesson. Do this now: 


e In the RStudio Menu, go to Tools > Global Options to bring up 
RStudio’s options dialog box. 


e Then go to General > Basic, and uncheck the box that says 
“Restore .RData into workspace at startup”. 


e For the option, “save your workspace to .RData on exit”, set this to 


“Never”. 
WATCH OUT 


Options 


R General Graphics Advanced 


R Sessions 


= Code 

Default working directory (when not in a project): 
> Console = Browse... 
=] Appearance (V) Restore most recently opened project at startup 


Pane Layout (V) Restore previously open source documents at startup 


Workspace 


| | Packages 


( ] Restore .RData into workspace at startup 
Rmd = 
R Markdown | Save workspace to .RData on exit: 


® Python History 


Quick data exploration 


Now let's walk through some basic steps of data exploration—taking a broad, bird’s eye 


lo 


ok at the dataset. You should put this section under a heading like “Explore data” in your 


script. 


To view the top and bottom 6 rows of the dataset, you can use the head() and tail () 


functions: 


# Explore data ---- 
head(ebola_sierra_ leone) 


# A tibble: 6 x 7 
id age sex status 

<dbl> <db1l> <chr> <chr> 
1 92 6M confirmed 
2 51 46 F confirmed 
3 230 NA M confirmed 
4 139 25 F confirmed 
5 8 8 F confirmed 
6 215 49 M confirmed 
# .. with 1 more variable: 


tail (ebola_sierra_leone) 


id 
<db1l> 
214 

28 
12 
110 
209 
35 


#-F OO B® WN EF 


To view the whole dataset, use the View() function. 


age 


<dbl> 


6 
45 
27 

6 
40 
29 


# A tibble: 6 x 7 


sex status 

<chr> <chr> 

F confirmed 
F confirmed 
F confirmed 
M confirmed 
F confirmed 
M suspected 


. with 1 more variable: 


date_of_onset date of sample 


<date> 

2014-06-10 
2014-05-30 
2014-06-26 
2014-06-13 
2014-05-22 
2014-06-24 


district <chr> 


date _of onset 


<date> 

2014-06-24 
2014-05-27 
2014-05-22 
2014-06-10 
2014-06-24 
2014-05-28 


district <chr> 


View(ebola_sierra_leone) 


<date> 

2014-06-15 
2014-06-04 
2014-06-30 
2014-06-18 
2014-05-27 
2014-06-29 


date of sample 


<date> 

2014-06-30 
2014-06-01 
2014-05-27 
2014-06-15 
2014-06-27 
2014-06-01 


This will again open a familiar spreadsheet view of the data: 


kl 


bh WÙ N he 


sex status 
6.0 M confirmed 
46.0 F confirmed 
M confirmed 
25.0 F confirmed 


date_of_onset 
2014-06-10 
2014-05-30 
2014-06-26 
2014-06-13 


You can close this tab and return to your script. 


date_of_sample 
2014-06-15 
2014-06-04 
2014-06-30 
2014-06-18 


district 
Kailahun 
Kailahun 
Kenema 


Kailahun 


The functions nrow(),ncol() and dim () give you the dimensions of your dataset: 


nrow(ebola_sierra_leone) # number of rows 


## [1] 200 


ncol(ebola_ sierra leone) # number of columns 


## [1] 7 


dim(ebola_sierra_leone) # number of rows and columns 


## [1] 200 7 


the ncol () function, run: 


2mo 


Another often-helpful function is summary (): 


summary (ebola _sierra_leone) 


Ht id age sex 
date _of onset 

## Min. : 1.00 Min. : 180 Length: 200 
Min. :2014-05-18 


## Ist Qu.: 62.75 Ist Qu.:20.00 Class :character 
Ist Qu.:2014-06-01 

## Median :131.50 Median :35.00 Mode :character 
Median :2014-06-13 

## Mean 2136.0 72 Mean 233.85 

Mean :2014-06-12 

## 3rd Qu.:208.25 3rd Qu.:45.00 

3rd Qu. :2014-06-23 


## Max. 7285.00 Max. :80.00 

Max. 72014-06-29 

Ht NA's 74 

## date _of sample district 
## Min. :2014-05-23 Length: 200 


## lst Qu.:2014-06-07 Class :character 
## Median :2014-06-18 Mode :character 


13 


ee ee ee ee ee ee Ef) 


REMINDER |f you're not sure what a function does, remember that you can get 
function help with the question mark symbol. For example, to get help on 


status 


Length:200 


Class :character 


Mode :character 


## Mean :2014-06-17 
## 3rd Qu.:2014-06-29 
## Max. :2014-07-17 
## 


As you can see, for numeric columns in your dataset, summary () gives you the minimum 


value, the maximum value, the mean, median and the 1st and 3rd quartiles. 


For character columns it gives you just the length of the column (the number of rows), 
the “class” and the “mode”. We will discuss what “class” and “mode” mean later. 


vis_dat() 


The vis_dat() function from the {visdat} package is a wonderful way to quickly visualize 


the data types and the missing values in a dataset. Try this now: 


vis dat (ebola_sierra_leone) 


x ve ve Ò S 


Observations 


Type 
character 
Date 
numeric 


NA 


From this figure, you can quickly see the character, date and numeric data types, and you 


can note that age is missing for some cases. 


inspect _cat() and inspect _num() 


Next, inspect_cat() and inspect _num() from the {inspectdf} package give you visual 
summaries of the distribution of variables in the dataset. 


If you run inspect _cat() on the data object, you get a tabular summary of the 
categorical variables in the dataset, with some information hidden in the levels column 
(later you will learn how to extract this information). 


inspect _cat(ebola_ sierra leone) 


# A tibble: 5 * 5 
col_name cnt common common _pcnt levels 
<chr> <int> <chr> <dbl> <named list> 
1 date_of onset 39 2014-06-10 10 <tibble> 
2 date_of sample 45 2014-06-15 9.5 <tibble> 
3 district 7 Kailahun 77.5 <tibble> 
4 sex 2 F 57 <tibble> 
5 status 2 confirmed 91 <tibble> 


But the magic happens when you run show plot () on the result from inspect cat (): 


i? atoe eha Qeijatie Oi iiaee Ceis () lis) eeke _eibimmeneye 
cat_summary <- inspect cat (ebola_sierra_ leone) 


# call the ‘show _plot()° function on that summmary. 
show plot (cat_summary) 


Frequency of categorical levels in df::ebola_sierra_leone 
Gray segments are missing values 


date_of_onset 2014-06-10 


date_of_sample 


district Kailahun Kenema 


status confirmed 


You get a wonderful figure showing the distribution of all categorical and date variables! 


SIDE NOTE 


You could also run: 


show_plot (inspect_cat(ebola_ sierra leone) ) 


ae ee ee eee 


Frequency of categorical levels in df::ebola_sierra_leone 
Gray segments are missing values 


date_of_onset 2014-06-10 


date_of_sample 


SIDE NOTE 


district Kailahun 


sex 


status confirmed 


From this plot, you can quickly tell that most cases are in Kailahun, and that there are 
more cases in women than in men (“F” stands for “female’). 


One problem is that in this plot, the smaller categories are not labelled. So, for example, 
we are not sure what value is represented by the white section for “status” at the bottom 
right. To see labels on these smaller categories, you can turn this into an interactive plot 
with the ggplotly() function from the {plotly} package. 


cat_summary plot <- show plot(cat_summary) 
ggplotly(cat_summary plot) 


Wonderful! Now you can hover over each of the bars to see the proportion of each bar 
section. For example you can now tell that 9% (0.090) of the cases have a suspected 
Status: 


col_name: status l 


status prop: 0.090 
new_level_key: suspected-status-ebola_sierra_leone | 


i ĖŮĖĖŮĖŮĖŘŘĖĖŮĖĖŮĖŮ HEH 


REMINDER 


A 


BHHRHHEHHHHEHFHHHEHEFHEHEHFHHHEHEFHFHEHEHEHEHERHEHEHHeEEFHEHEFFFFEFEeEHE HEH 


P22] a a a 
ee es a dd 


17 


REMINDER “The assignment arrow, <-, can be written with the RStudio shortcut alt 
+ - (alt AND minus) on Windows or option + - (option AND minus) on 
macOS. 


You can obtain a similar plot for the numerical (continuous) variables in the dataset with 
inspect _num(). Here, we show all three steps in one go. 


num summary <- inspect _num(ebola_ sierra_ leone) 
num summary plot <- show plot (num summary) 
ggplotly(num_ summary plot) 


This gives you an overview of the numerical columns, age and id. (Of course, the 
distribution of the id variable is not meaningful.) 


You can tell that individuals aged 35 to 40 (mid-point 37.5) are the largest age group, 
making up 13.8% (0.1377...) of the cases in the dataset. 


Analyzing a single numeric variable 


Now that you have a sense of what the entire dataset looks like, you can isolate and 
analyze single variables at a time-—this is called univariate analysis. 


Go ahead and create a new section in your script for this univariate analysis. 


# Univariate analysis, numeric variables ---- 
Let's start by analyzing the numeric age variable. 


Extract a column vector with $ 


To extract a single variable/column from a dataset, use the dollar sign, $ operator: 


ebola sierra leoneSage # extract the age column in the dataset 


[1] 6.0 46.0 NA 25.0 8.0 49.0 13.0 50.0 35.0 38.0 60.0 18.0 10.0 
14.0 50.0 35.0 43.0 17.0 3.0 
[20] 60.0 38.0 41.0 49.0 12.0 74.0 21.0 27.0 41.0 42.0 60.0 30.0 50.0 
50.0 22.0 40.0 35.0 19.0 3.0 
[39] 34.0 21.0 73.0 65.0 30.0 70.0 12.0 15.0 42.0 60.0 14.0 40.0 33.0 
43.0 45.0 14.0 14.0 40.0 35.0 
[58] 30.0 17.0 39.0 20.0 8.0 40.0 42.0 53.0 18.0 40.0 20.0 45.0 40.0 
60.0 44.0 33.0 23.0 45.0 7.0 


96] 26.0 37.0 30.0 3.0 56.0 32.0 35.0 54.0 42.0 48.0 11.0 1.8 63.0 
55.0 20.0 62.0 62.0 42.0 65.0 

115] 29.0 20.0 33.0 30.0 35.0 NA 50.0 16.0 3.0 22.20 -70 50.0 17.0 
40.0 21.0 9.0 27.0 52.0 50.0 

134] 25.0 10.0 30.0 32.0 38.0 30.0 50.0 26.0 35.0 3.0 50.0 60.0 40.0 
34.0 4.0 42.0 NA 54.0 18.0 

153] 45.0 30.0 35.0 35.0 16.0 26.0 23.0 45.0 45.0 45.0 38.0 45.0 35.0 
30.0 60.0 5.0 18.0 2.0 70.0 

172] 35.0 3.0 30.0 80.0 62.0 20.0 45.0 18.0 28.0 48.0 38.0 39.0 26.0 
60.0 35.0 20.0 50.0 11.0 36.0 

191] 29.0 5730 35.0 26.0 6.0 45.0 27.0 6.0 40.0 29:.0 


This list of values is called a vector in R. A vector is a kind of data 
structure that has elements of one type. In this case, the type is 
“numeric”. We will formally introduce you to vectors and other data 
structures in a future chapter. In this lesson, you can take “vector” and 
“variable” to be synonyms. 


VOCAB 


Basic operations on a numeric variable 


To get the mean of these ages, you could run: 


mean(ebola_sierra_ leoneSage) 


But it seems we have a problem. R says the mean is NA, which means “not applicable” or 
“not available”. This is because there are some missing values in the vector of ages. (Did 
you notice this when you printed the vector?) By default, R cannot find the mean if there 
are missing values. To ignore these values, use the argument na. rm (which stands for “NA 
remove") setting it to T, or TRUE: 


mean(ebola_ sierra _leoneSage, na.rm = T) 


## [1] 33.84592 


Great! This need to remove the NAs before computing a statistic applies to many 
functions. The median () function for example, will also return NA by default if it is called 
on a vector with any NAS: 


median(ebola_sierra_leoneSage) # does not work 


mean and median are just two of many R functions that can be used to inspect a 


numerical variable. Let’s look at some others. 


But first, we can assign the age vector to a new object, so you don’t have to keep typing 


ebola_sierra_leoneSage each time. 


asia Wee <- oola elsrra Icemescigia o COE cae vector co cha Qoc Mace wee 


Now run these functions on age_vec and observe their outputs: 


sd(age_vec, na.rm = T) # standard deviation 
## [1] 17.26864 

max(age vec, na.rm = T) # maximum age 
## [1] 80 

min(age vec, na.rm = T) # minimum age 

## [1] 1.8 


summary (age_vec) # min, max, mean, quartiles and NAs 


## Min. 
## 1.80 


Ist Qu. 
20.00 


Max. 
80.00 


Median 
35.00 


Mean 3rd Qu. 
33:85 45.00 


length(age_ vec) # number of elements in the vector 


## [1] 200 


sum(age_ vec, na.rm = W) 


# sum of all elements in the vector 


20 


## [1] 6633.8 


Do not feel intimidated by the long list of functions! You should not have to memorize 
them; rather you should feel free to Google the function for whatever operation you want 
to carry out. You might search something like “what is the function for standard deviation 
in R”. One of the first results should lead you to what you need. 


Visualizing a numeric variable 


Now let's create a graph to visualize the age variable. The two most common graphics for 
inspecting the distribution of numerical variables are histograms (like the output of the 
inspect _num() function you saw earlier) and boxplots. 


R has built-in functions for these: 


hist (age_vec) 


Histogram of age_vec 


30 40 
l 


Frequency 
20 
l 


age_vec 


boxplot (agel vec) 


Nice and easy! 


Graphical functions like boxplot() and hist() are part of R's base graphics package. These 
functions are quick and easy to use, but they do not offer a lot of flexibility, and it is 
difficult to make beautiful plots with them. So most people in the R community use an 
extension package, {ggplot2}, for their data visualization. 


In this course, we'll use ggplot indirectly; by using the {esquisse} package, which provides 
a user-friendly interface for creating ggplot2 plots. 


The workhorse function of the {esquisse} package is esquisser (), and this function 
takes a single argument—the dataset you want to visualize. So we can run: 


esquisser (ebola_sierra_ leone) 


This should bring a graphic user interface that you can use to plot different variables. To 
visualize the age variable, simply drag age from the list of variables into the x axis box: 


Esquisse 


| 


Histogram Dra g 


When age is in the x axis box, you should automatically get a histogram of ages: 


22 


o O O cD E 


0- 


Histogram 


>) Play & 


80 


You can change the plot type by clicking on the “Histogram” button and selecting one of 
the other valid plot types. Try out the boxplot, violin plot and density plot and observe the 
outputs. 


o oO O 
Histogram © 
Auto Step 


Histogram 


Pae 1) L) 


Boxplot Violin Density 


When you are done creating a plot with {esquisse}, you should copy the code that was 
created by clicking on the “Code” button at the bottom right then “Copy to clipboard”: 


E 2 : 
Copy to clipboard 
Code: 


ggplot(ebola_sierra_leone) + 

aes(x = age) + 

geom_histogram(bins = 30L, fill = "#112 
446") + 

theme_minimal() 


© Insert code in script 1 i 


Now, paste that code into your script, and make sure you can run it from there. The code 
should look something like this: 


ggplot (ebola_sierra_ leone) 
aes(x = age) 
Siren laulisieteyopachal (evils) = SOI, aeal II TA 
theme minimal () 


By copying the generated code into your script, you ensure that the data visualization you 
created is fully reproducible. 


PRO TIP : : : : 
y {esquisse} can only create fairly simple graphics, so when you want to 


x make highly customized or complex plots, you will need to learn how to 
write {ggplot} code manually. This will be the focus of a later course. 


X Ar 


You should also test out the other tabs on the bottom toolbar to see what they do: Labels 
& Title, Plot options, Appearance and Data. 
Easy bivariate and multivariate plots 


CHALLENGE In this lesson we are focusing on univariate analysis: exploring and 
visualizing one variable at a time. But with esquisse; it is so easy to make 
Am” a bivariate or multivariate plot, so you can already get your feet wet with 
this. 


Try the following plots: 


24 


e Drag age to the X box and sex to the Y box. 


CHALLENGE 
K 


’ e Drag age to the X box, sex to the Y box, and sex to the fill box. 


e Drag age to the X box and district to the Y box. 


Analyzing a single categorical variable 


Next, let’s look at a categorical variable, the districts of reported cases: 


# Univariate analysis, categorical variables ---- 
ebola sierra leoneSdistrict 


[1] "Kailahun" "Kailahun" "Kenema" "Kailahun" 
"Kailahun" "Kailahun" 
# [7] "Kailahun" "Kailahun" "Kenema" "Kailahun" 
"Kailahun" "Kailahun" 

[13] "Kailahun" "Kailahun" "Kailahun" "Kailahun" 
"Kailahun" "Kenema" 
4 [19] "Kono" "Kailahun" "Kailahun" "Kailahun" 
"Kenema" "Kailahun" 

[25] "Kailahun" "Kailahun" "Kailahun" "Kailahun" 
"Kenema" "Kenema" 
1 [31] "Kenema" "Kailahun" "Kailahun" "BOY 
"Kailahun" "Kailahun" 

[37] "Kailahun" "Kenema" "Kenema" "Kenema" 
"Kailahun" "Kailahun" 
# [43] "Kailahun" "Kailahun" "Kailahun" "Kailahun" 
"Western Urban" "Kailahun" 

[49] "Kailahun" "Kailahun" "Kailahun" "Kailahun" 
"Kailahun" "Kailahun" 
# [55] "Kailahun" "Kailahun" "Kailahun" "Kailahun" 
"Kailahun" "Kailahun" 

[61] "Kailahun" "Kenema" "Western Urban" "Kambia" 
"Kailahun" "Kailahun" 
# [67] "Kailahun" "Kailahun" "Kailahun" "Kailahun" 
"Kailahun" "Kailahun" 

[73] "Kenema" "Kailahun" "Kailahun" "Kenema" 
"Kailahun" "Kailahun" 

[79] "Kenema" "Kailahun" "Kailahun" "Kailahun" 
"Kailahun" "Kailahun" 
# [85] "Kailahun" "Kailahun" "Kailahun" "Kailahun" 
"Kailahun" "Kenema" 

[91] "Kailahun" "Kailahun" "Kailahun" "Kono" 
"Port Loko" "Kenema" 
# [97] "Kailahun" "Kailahun" "Kailahun" "Kailahun" 


25 


"Kal 


"Kai 


"Kai 


"Kai 


"Kai 


"Kai 


"Kai 


"Kai 


"Kai 


"Kai 


"Kai 


"Kai 


"Kai 


Kenema" 

103] " 
lahun" 
109] T 
lahun" 
115] " 
lahun" 
121] ™ 
lahun" 
1271] ™ 
lahun" 
133]. " 
lahun" 
139] " 
lahun" 
145] " 
lahun" 
151]. ™ 
lahun" 
LST T 
"Kenema" 
163]. " 
lahun" 
LEIJ -T 
"Kenema" 
175] " 
lahun" 
181] T 
lahun" 
187] " 
lahun" 
T93] T 
"Kail 


Kai 


Kai 


Kai 


Kai 


Kai 


Kai 


Kai 


lahu 


lahu 


lahu 


lahu 


lahu 


lahu 


lahu 


" 


" 


" 


W 


" 


" 


" 


" 


Kenema" 


Kai 


Kai 


Kai 


Kai 


Kai 


Kai 


lahu 


lahu 


lahu 


lahu 


lahu 


lahu 


" 


" 


" 


" 


" 


" 


" 


Kenema" 


Kai 


Kai 


" 


" 


Kailahun" 


Kailahun" 


Kailahun" 


Kailahun" 
n Ww Ww 
Kenema" 

n Ww Ww 
Kailahun" 
n Ww Ww 


Kailahun" 
n Ww Ww 
Kenema" 


" 


Kailahun" 
n Ww Ww 
Kenema" 

n Ww Ww 
Kailahun" 
Kailahun" 
Kailahun" 
Kailahun" 


Kailahun" 


Kailahun" 


lahun" " 


Kailahun" 


lahun" 1 


Kenema" 


Kailahun" 


Kailahun" 


Kailahun" 


Port Loko" 


Kailahun" 


Kailahun" 


Kailahun" 


Kailahun" 


Kenema" 


Kailahun" 


Kailahun" 


Kenema" 


Kailahun" 


Kailahun" 


Kenema" 


Kenema" 


Sorry for printing that very long vector! 


Frequency tables 


You can use the table() function to create a frequency table of a categorical variable: 


table (ebola_sierra_leoneSdistrict) 


Kailahun 


Ht 

## Bo 

Port Loko Western Urban 
++ 2 

2 4 


155 


"Kailahun" 


"Kenema" 


"Kailahun" 


"Kenema" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Kenema" 


"Kenema" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Kenema" 


Kambia 


You can see that most cases are in Kailahun and Kenema. 


"Kailahun" 


"Western Urban" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Kailahun" 


"Bo" 


"Kailahun" 


"Kailahun" 


"Western Urban" 


"Kailahun" 


"Kailahun" 


Kenema 


34 


Kono 


26 


table () is auseful “base” function. But there is a better function for creating frequency 
tables, called taby1 (), from the {janitor} package. 


To use it, you supply the name of your data frame as the first argument, then the name of 
variable to be tabulated: 


Eabyill (Cholamsi crn al cone Gist aitcit) 


district n percent 

Bo 2 0.010 

Kailahun 155 Oe 075 
Kambia 1 0.005 

Kenema 34 0.170 

Kono 2 0.010 

Port Loko 2 0.010 
Western Urban 4 0.020 


As you can see, tabyl () gives you both the counts and the percentage proportions of 
each value. It also has some other attractive features you will see later. 


You can also easily make cross-tabulations with taby1 (). Simply add 
additional variables separated by a comma. For example, to create a 
cross-tabulation by district and sex, run: 


raloyd (loyevllel Sialeusisel Ikevelavey, Charney BE) 


PRO TIP district E M 
X Bo O 2 
Kailahun 91 64 
Kambia © i 

Kenema 20 14 

Kono 0 2 

Port Loko: 7 i 
Western Urban 2 2 


wx 


The output shows us that there were O women in the Bo district, 2 men in 
the Bo district, 91 women in the Kailahun district, and so on. 


Visualizing a categorical variable 


Now, let’s try to visualize the district variable. As before, the best way to do this is with 
the esquisser () function from {esquisse}. Run this code again: 


esquisser (ebola_sierra_ leone) 


Then drag the district variable to the X axis box: 


GS B Esquisse 
(ia JE age &© «Dd 


You should get a bar chart showing the count of individuals across districts. Copy the 
generated code and paste it into your script. 


Answering questions about the outbreak 


With the functions you have just learned, you have the tools to answer the questions 
about the Ebola outbreak that were listed at the top. Give it a go. Attempt these 
questions on your own, then look at the solutions below. 


e When was the first case reported? (Hint: look at the date of sample) 

- As at the end of June 2014, which 10-year age group had had the most cases? 
e What was the median age of those affected? 

- Had there been more cases in men or women? 

e What district had had the most reported cases? 

- By the end of June 2014, was the outbreak growing or receding? 


Solutions 
e When was the first case reported? 


min(ebola_sierra_leoneSdate of sample) 


## [1] "2014-05-23" 


We don't have the date of report, but the first “date_of_sample” (when the Ebola test 
sample was taken from the patient) is May 23rd. We can use this as a proxy for the date 
of first report. 


e What was the median age of cases? 


median(ebola sierra leoneSage, na.rm = T) 


The median age of cases was 35. 


- Are there more cases in men or women? 


tabyl (ebola_ sierra leoneS$sex) 


## ebola sierra leone$sex n percent 
++ F 114 0.57 
tt M 86 0.43 


As seen in the table, there were more cases in women. Specifically, 57% of cases are of 
women. 


« What district has had the most reported cases? 


tabyl (ebola_ sierra leoneSdistrict) 


ebola sierra leoneSdistrict n percent 
Bo 2 0.010 

Kailahun 155 04775 

Kambia 1 0.005 

Kenema 34 0.170 

Kono 2 0.010 

Port Loko 2 0.010 

Western Urban 4 0.020 


# We can also plot the following chart (generated with esquisse) 
ggplot(ebola_ sierra leone) 4 


aes(x = district) + 
geompbari(ea I — MAA Gu) st 
theme minimal () 


29 


100 


count 


50 


Bo Kailahun Kambia Kenema Kono Port Loko Western Urban 
district 


0 


As seen, the Kailahun district had the majority of cases. 
- By the end of June 2014, was the outbreak growing or receding? 
For this, we can use esquisse to generate a bar chart that shows a count of cases in each 


day. Simply drag the date _of onset variable to the x axis. The output code from 
esquisse should resemble the below: 


ggplot(ebola_sierra_leone) 4 
ESS > Cere Cie iolelsKSne)) ar 


GjEom lola sieerereeiul(oabals! = SiO, ies > Was aia) ae 
theme minimal () 


20 


May 15 Jun 01 Jun 15 Jul 01 
date_of_onset 


30 


Great! But it is debatable whether the outbreak was growing or receding at the end of 
June 2014; a precise trend is not really clear! 


Haven't had enough? 


If you would like to practice some of the methods and functions you learned on a similar 
dataset, try downloading the data that is stored on this page: https://bit.ly/view-yaounde 
-covid-data 


That dataset is in the form of an Excel spreadsheet, so when you are importing the 
dataset with RStudio, you should use the “From Excel” option (File > Import Dataset > 
From Excel). 


This dataset contains the results of a COVID-19 serological survey conducted in Yaounde, 
Cameroon in late 2020. The survey estimated how many people had been infected with 
COVID-19 in the region, by testing for IgG and IgM antibodies. The full dataset can be 
obtained from here: go.nature.com/3R866wx 


Wrapping up 


Congratulations! You have now taken your first baby steps in analyzing data with R: you 
imported a dataset, explored its structure, performed basic univariate analysis and 
visualization on its numeric and categorical variables, and you were able to answer 
important questions about the outbreak based on this. 


Of course, this was only a sneak peek of the data analysis process—a lot was left out. 
Hopefully, though, this sneak peek has gotten you a bit excited about what you can do 
with R. And hopefully, you can already start to apply some of these to your own datasets. 
The journey is only beginning! See you soon. 


Contributors 


The following team members contributed to this lesson: 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


31 


References 


Some material in this lesson was adapted from the following sources: 


e Barnier, Julien. “Introduction a R Et Au Tidyverse.” Partie 13 Diffuser et publier avec 
rmarkdown, May 24, 2022. https://juba.github.io/tidyverse/13-rmarkdown.html. 


e Yihui Xie, J. J. Allaire, and Garrett Grolemund. “R Markdown: The Definitive Guide.” 
Home, April 11, 2022. https://bookdown.org/yihui/rmarkdown/. 


This work is licensed under the Creative Commons Attribution Share Alike license. 


32 


Lesson notes | RStudio projects 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


Gein Started: RStUdIO DROECS 4 cicada owed Sete ark E ee Ke Seeewd Be dase dekwe 
DOO ODJECLIVES i arae 54 44 oye ewe eee we GbE 4 O64 oS DREW wo OPE OES BEER ERE EE 
DOO a3 peiie Se Gent ERO a de Sond O4-e Hed 5544.54.44 OR WARE AG 644444 om 
Creating a new RStudið Project io ck sessto cadena ng aie awit hep hed sea een bade ad and whe 
Creating Project subfolders.. ia Pca ehereeevadee dee besdgaotictcadeerdad ony agyads 
Adding a dataset to the “data folder ...4.s4.cecesaces and cesaweewseaes ce ceded ds 
Creating ascript in the “scripts folder a4 6 acs necks soe kee SES ORR ER EE RO RS 
mportingdatar ronm the data folder ct .ce sae gat gey hoe fas sae See Eek Fe eee OEE 
Exporting data to the “outputs” folder............. 2.0... cee eee 
Exporting plots to the “outputs” folder ......... 20... 00... 
SAMY aA Proje CE 2505466464 0eo0Gd tietie Ekri 494400 r hi EMS TEESE RSS S aR 
Ween UD. erara behets ade tbe deg deed add see te ade bbe soe Ode weds 2b4.9. 4-5 ee 


Getting started: RStudio projects 


Learning objectives 


1. You can set up an RStudio Project and create sub-directories for input data, scripts 
and analytic outputs. 


2. You can import and export data within an RStudio Project. 
3. You understand the difference between relative and absolute file paths. 


4. You recognize the value of Projects for organizing and sharing your analyses. 


Introduction 


Previously, you walked through some of the essential steps of data analysis, from 
importing data to calculating basic statistics. But you skipped over one crucial step: 
setting up a data analysis project. 


Experienced data analysts keep all the files associated with a specific analysis—input data, 
R scripts and analytic outputs—together in a single folder. These folders are called 
projects (small p), and RStudio has built-in support for them via RStudio Projects (capital 
P). 


In this lesson you will learn how to use these RStudio Projects to organize your data 
analysis coherently, and improve the reproducibility of your work. You will replicate some 
of the analysis you did in the last data dive lesson, but in the context of an RStudio 
Project. 


Let's get started. 


Creating a new RStudio Project 


Creating a new RStudio Project looks different if you are on a local computer and if you 
are on RStudio Cloud. Jump to the section that is relevant for you. 


On RStudio Cloud 


If you are using RStudio Cloud, you have probably a/ready created a project, because you 
can't do any analysis without projects. 


The steps are pretty simple: go to your Cloud homepage, rstudio.cloud, and click on the 
“New Project” button. 


= Your Workspace Projects v 


Your ProjectsY (5) 


Name your Project something like ebola_analysis or ebola analysis proj if you 
already have a project named ebola_ analysis. 


= Your Workspace /| | ebola_analysis 


File Edit Code View Plots Session Build Debug Profile Tools 


Oo ~| 2- =| > Go to file/function -=| ~ Addins ~ 


The RStudio Project you have now created is just a folder on a virtual computer, which has 
a .Rproj file within it (and maybe a .RHistory file). You should be able to see this .Rproj file 
in the Files pane of RStudio: 


Environment History Plots Packages Help Tuto! = = 
io -|O al e- 
A Home > Dropbox > Mac (2) > Desktop > ebola_analysis Ea 
4 Name Size Modified 


t 


Œ ebola_analysis.Rproj 205 B May 30, 2022, z 


KEY POINT 
- The .RProj file is what turns a regular computer folder into an “RStudio 
Q Project”. 


On a local computer 


If you are on a local computer, open RStudio, then on the RStudio menu, go to File > 
New Project. Your options may look a little different from the screenshots below 
depending on your operating system. 


File Edit Code View Plots Sessia 


New Project... 


Choose “New directory” 


New Project 


Create Project 


R New Directory 


= Start a project in a brand new working directory 


Existing Directory 
R Associate a project with an existing working directory 


Version Control 
Checkout a project from a version control repository 


Cancel 
Then choose “New Project”: 
| New Project Wizard 
Back Project Type 

i| 

& New Project > 
| Create anew 

ap R Package project = an empty > 
i 
| R Shiny Application > 


You can call your Project something like “ebola_analysis” and make it a “subdirectory” of a 
folder that is easy to find, such as your desktop. (The phrase “Create project as 
subdirectory of” sounds scary, but it’s not; RStudio is simply asking: “where should | put 
the project folder”?) 


1 New Project 


Back | Create New Project 


| Directory name: 


| ebola_analysis 
k Create project as subdirectory of: 


TA ~/Desktop | Browse... 


{| Create a git repository 


|_| Use packrat with this project 


| (J Open in new session | Create Project | | Cancel | 


The RStudio Project you have created is just a folder with a .Rproj file within it (and maybe 
a .RHistory file). You should be able to see this .Rproj file in the Files pane of RStudio: 


Environment History Plots Packages Help Tuto! —. = 

glo -i9 mlg- @ 

BEA Home > Dropbox > Mac (2) > Desktop > ebola_analysis ® ve 
A Name Size Modified 


t 


(J | ®) ebola_analysis.Rproj 205 B May 30, 2022, z 


Click on the .Rproj file to open your project 


The .RProj file is what turns a regular computer folder into an “RStudio 


Q From now on, to open your project, you should double click on this .RProj 
file from your computer's Finder/File Explorer. 


On Windows, here is an example of what a .Rproj file will look like from 
the File Explorer: 


PC > Desktop > intro-to-data-analysis-with-r > 
Name 


T Rproj.user 
i ch01_getting_started 
i ch02_data_flow 
T ch03_intro_to_data_viz 
a ch04_data_wrangling 
1 ch05_joining_and_pivoting 
i ch06_basic_maps 
+ global 

.DS_Store 
© gitignore 
=E Rhistory 


KEY POINT 


¥ README 


On macOS, here is an example of what a .Rproj file will look like from 
Finder: 


intro-to-data-analysis-with-r 


Name 


> © ch01_getting_started 

> B ch02_data_flow 

> D ch03_intro_to_data_viz 

> D ch04_data_wrangling 

> @ chO5_joining_and_pivoting 
> B chO6_basic_maps 

global 


&  jntro-to-data-analysis-with-r.Rproj 


Note also that there is a header at the top right of RStudio window that tells you which 
Project you currently have open. Clicking on this gives you some additional Project 


options. You can create a new project, close a project and open recent projects, among 
other options. 


[R] ebola_analysis ~ 


OR New Project... 
£? Open Project... 
Open Project in New Session... 


Close Project 


ebola_analysis 


ahala ciarra lanna 
Creating Project subfolders 


Data analysis projects usually have at least three sub-folders: one for data, another for 
scripts, and a third for outputs, as seen below: 


Your Project Name 


.Rproj (R Project File) 


data 
scripts 
outputs 


Let’s look at the sub-folders one by one: 


e data: This contains the source (raw) data files that you will use in the analysis. 
These could be CSV or Excel files, for example. 


e scripts: This sub-folder is where you keep your R scripts. You can also save 
RMarkdown files in this folder. (You will learn about RMarkdown files soon.) 


e outputs: Here, you save the outputs of your analysis, like plots and summary tables. 
These outputs should be disposable and reproducible. That is, you should be able to 
regenerate the outputs by running the code in your scripts. You will understand this 
better soon. 


Now go ahead and create these three sub-folders, “data”, “scripts” and “outputs”. within 
your RStudio Project folder. You should use the “New Folder” button on the RStudio Files 
pane to do this: 


Files Plots Packages 


©) New Folder | © New Bi 


Adding a dataset to the “data” folder 


Next, you should move the Ebola dataset you downloaded in the previous lesson to the 
newly-created “data” sub-folder (you can re-download that dataset at bit.ly/ebola-data if 
you can't find where you stored it). 


The procedure for moving this dataset to the “data” folder is different for RStudio Cloud 
users and those using a local computer. Jump to the section that is relevant for you. 


On RStudio Cloud 


If you are on RStudio Cloud, adding the dataset to your “data” folder is straightfoward. 
Simply navigate to the folder within the Files pane, then click the “Upload” button: 


Files | Plots Packages Help Viewer Presentation 
@ Folder © Blank File ~ | Ọ | Upload |© Delete =|Rename $~ 
@ Cloud > project (data | 


A Name Size Modified 
È: 
Click to upload 


This will bring up a dialog box where you can select the file for upload. 


On a local computer 


On a local computer, this step has to be done with your computer's File Explorer/Finder. 


e First, locate the Project folder with your computer's File Explorer/Finder. If you're 
having trouble locating this, RStudio can help: go to the “Files” tab, click on “More” 
(the gear icon), then click “Show Folder in New Window”. 


History Plots Packages Help Tutor 
> Blank File ~ © Delete +)Rename 


drop Copy... ne > ol 
Copy To F Modifi 
Move... 

pro Copy Folder Path to Clipboard fay 30, 

Jor lay 30, 


i 
¿©? Open Selected in Source Pane 
ict_ lay 30, 
«< Open Each File in New Columns 


Set As Working Directory 
Go To Working Directory 
v Synchronize Working Directory 


Open New Terminal Here 


Show Folder in New Window 


Show Hidden Files 


This will bring you to the Project folder in your computer's File Explorer/Finder. 


. Now, move the Ebola dataset you downloaded in the previous lesson to the newly- 
created “data” sub-folder. 


Here is what moving the file might look like on macOS: 


f Downloads ə f ebola_analysis A 


Date Modified 


Today Date Modified Name 
| M | ebola_sie...leone.csv © Today at 22:54 v z data © Today at 23:57 
Dra g N ebola_si...eone.csv © 13 May 2022 a 


Creating a script in the “scripts” folder 


Next, create and save a new R script within the “scripts” folder. You can call this 
“main_analysis” or something similar. To create a new R script within a folder, first 
navigate to that folder in the Files pane, then click the “New Blank File” button and select 


“R script” in the dropdown: 


Environment History Files Plots Packages Help Tutorial Viewer Pre: = [M 


@ New Folder | © New Blank File ~ |@ Delete =)Rename {af More ~ 


OW Home > Dropbox ac (2) > Desktop >} ebola_analysis > scripts 


Name Size Y Modified 


È- 
1. Navigate to “scripts” 


2. Click for new script 


SIDE NOTE Note that this is different from what you have done so far when creating 
a new script (before, you used the menu option, File > New File > 
New Script). The old way is still valid; but this “New Blank File” button 
will probably be faster for you. 


Great work so far! Now your Project folder should have the structure shown below, with 
the “ebola_sierra_leone.csv” dataset in the “data” folder and the “main_analysis.R” script 


(still empty) in the “scripts” folder: 


ebola_sierra_leone.csv 


scripts 


main_analysis.R 
outputs 


This is a process you should go through at the start of every data analysis project: set up 
an RStudio Project, create the needed sub-folders, and put your datasets and scripts in 
the appropriate sub-folders. It can be a bit painful, but it will pay off in the long run. 


The rest of this lesson will teach you how to conduct your analysis in the context of this 
folder setup. At the end, you will have an overall flow of data and outputs that resembles 
the diagram below: 


ebola_sierra_leone.csv 


R main_analysis.R 


categorical_plot.png 


numeric_plot.png 


| | district_table.csv 


Figure: Data flow in an R project. Scripts in the “scripts” folder import data from “data” 


folder and export data and plots to the “outputs” folder 


scripts import raw data 
scripts - from the data folder 


and write outputs to 
outputs ; the outputs folder : 


You should refer back to this diagram as you proceed through the sections below to help 


orient yourself. 


Importing data from the “data” folder 


We will use the code snippet below to demonstrate the flow of data through a Project. 
Copy and paste this snippet into your “main_analysis.R” script (but don’t run it yet). The 


code replicates parts of the analysis from the data dive lesson. 


# Ebola Sierra Leone analysis 
# John Sample-Name Doe 
# 2024-01-01 


# Load packages ---- 
if(!require(pacman)) install.packages ("pacman") 
pacman: :p load ( 

tidyverse, 

Janniro, 

inspectdf, 

here # new package we will use soon 


# Load data ---- 
ebola si rromi ONE <=) rF ad iecsy (Gi) # DATA PENDING! WE WILL UPDATE THIS BELOW. 


m Gases Dy als Cric ———— 
ciot ieritabi- tabyli(cholaysverragleone,, distret) 
district EtaH 


# Visualize categorical variables ---—- 
categ vars plot<- show plot (inspect _cat(ebola_sierra_ leone) ) 
Cakegq vars pror 


i? Vateha Nene e AR === 
num_vars plot <- show plot (inspect _num(ebola_sierra_ leone) ) 
num_vars plot 


First run the “Load packages” section to install and/or load any needed packages. 


Then proceed to the “Load data” section, which looks like this: 


# Load data ---- 
ebola_ si reom! CGS ae adiesv (U) # DATA PENDING! WE WILL UPDATE THIS BELOW. 


Here you want to import the Ebola dataset that you previously placed inside the Project's 
“data” folder. To do this, you need to supply the file path of that dataset as the first 


argument of read_csv(). 


Because you are using an RStudio Project, this path can be obtained very easily: place 
your cursor inside the quotation marks within the read_csv() function, and press the 
Tab key on your keyboard. You should see a list of the sub-folders available in your Project. 
Something like this: 


~ # Load data ---- 
ebola_sierra_leone <- read_csv "f> # DATA PENDING! WE WILL UPDATE THIS 


~ # Which districts have the mos 


district_vec <- ebola_sierra_le ® ebola_sierra_leone.Rproj 
tabyl(district_vec) E outputs 


sA scrints 


Click on the “data” folder, then press Tab again. Since you only have one file in the “data” 
folder, RStudio should automatically fill in it’s name. You should now see: 


obole sierra leone <=- read Cev ("data/ebola sierra leone car) 


Wonderful! Run this line of code now to import the data. 


If this is successful, you should see the data appear in the Environment tab of RStudio: 


R ~ f Global Environment ~ 


Data 
© ebola_sierra_leone 200 obs. of 7 variables 


Relative paths 


The path you have used here, “data/ebola_sierra_leone.csv’, is called a 
relative path, because it is relative to the root (or the base) of your 
Project. 


How does R know where the root of your Project is? That’s where the 
.RProj file comes in. This file, which lives in the “ebola_analysis” folder tells 
KEY POINT R “here! Here! | am in the ‘ebola_analysis’ folder so this must be the 

Ir root!”. Thus, you only need to specify path components that are deeper 


- Q than this root. 


RStudio Projects, and the relative paths they allow you to use, are 
important for reproducibility. Projects that use relative paths can be run 
on anyone's computer, and the importing and exporting code should 
work without any hiccups. This means that you can send someone an 
RStudio Project folder and the code should run on their machine just as it 
ran on yours! 


This would not be the case if you were to use an absolute path, 
something like 
“~/Desktop/my_data_analysis/learning_r/ebola_sierra_leone.csv”, in your 


KEYPOINT script. Absolute paths give the full address of a file, and will not usually 
work on someone else’s computer, where files and folders will be 
Q arranged differently. 


RSTUDIO l l l l 
CLOUD Note that if you are using RStudio Cloud, you are forced to use relative 


paths, because you cannot access the general file system of the virtual 
computer; you can only work within specific Project folders. 


Using here: :here () 


As you have now seen, RStudio Projects simplify the data import process and improve the 
reproducibility of your analysis, primarily because they allow you to use relative paths. 


But there is one more step we recommend when using relative paths: rather than leave 
your path naked, wrap it in the here () function from the {here} package. 


So, in the data import section of your script, change read _ csv ()’`s input from 
"data/ebola_sierra_leone.csv" to here ("data/ebola sierra leone.csv"): 


ebola_ sierra leone <- read csv (here ("data/ebola _sierra_leone.csv") ) 


What is the point of wrapping the path in here () ? Well, technically, this is no real point in 
doing this in an A script; the importing code works fine without it. But it w///be necessary 
when you start using RVarkdown scripts (which you will soon be introduced to), because 

paths not wrapped in here() are problematic in the RMarkdown context. 


So to keep things consistent, we always recommend you use here () when pointing to 
paths, whether in an R script or an RMarkdown script 


Exporting data to the “outputs” folder 


Importing data is not the only benefit of RStudio Projects; data export is also streamlined 
when you use Projects. Let's look at this now. 


In the “Cases by district” section of your script, you should have: 


if Weise: lowe OIE => 
lalsicrealohu. calo <= qereloyill (doole eiere EOE, (ola Sheresl(ehe) 
district ktaD 


Run this code now; you should get the following tabular output: 


distrret n percent 

Bo 2 0.010 

Kailahun 155 0.775 
Kambia 1 0.005 

Kenema 34 0.170 

Kono 2 0.010 

Port Loko 2 0.010 
Western Urban 4 0.020 


Now, imagine that you want to export this table as a CSV. It would be nice if there was a 
specific folder designated for such exports. Well, there is! It’s the “outputs” folder you 
created earlier. Let’s export your table there now. Type out the code below (but don’t run 
it yet): 


eal ies: Gevi = tehiisicicaeie ies, nila = #1) 


With the write_csv () function, you are going to “write” (or “save") the district tab 
table as a CSV file. 


The x argument of write csv() takes in the object to be saved (in this case 

district tab). And the file argument takes in the target file path. This target file path 
can be a simple relative path: “outputs/district_table.csv”. (And, as mentioned before, we 
should wrap the path in here ().) Type this up and run it now: 


write csv(x = district tab, file = here("outputs/district table.csv") ) 
The path “outputs/district_table.csv” tells write csv() to save the plot as a CSV file 


named “districts_table” in the “outputs” folder of the Project. 


SIDE NOTE You can replace “district_table.csv” with any other appropriate name, for 
example “freq table across districts.csv’: 


write csv(x = district tab, file = here("outputs/freq table 
Ass ChuSiricilces. csv”) )) 


Great work! Now, if you go to the Files tab and navigate to the outputs folder of your 
Project, you should see this newly created file: 


Environment History [ ites | piots Packages Help Tutorial Viewer s M] 
Q Folder ®© Blank File ~ © Delete 5|Rename {jp ~ B 


BEA Home > Dropbox > Mac (2) > Desktop > ebola_sierra_leone D ves 


4 Name Size Modified 


©. 
@ 1328 May 29, 2022, 6:33 PM 


You can click on the file to view it within RStudio as a raw CSV: 


A Name Size Modified 
©... 
EN district_table 132 B May 29, 2022, 6:44 PM 


=> Import Dataset... 


This should bring up an RStudio viewer window: 


ola_analysis.R _| district_table.csv 


LA 


district_vec,n,percent 
Bo,2,0.01 
Kailahun,155,@.775 
Kambia,1,0.005 

Kenema, 34,0.17 
Kono,2,@.01 

Port Loko,2,0.01 
Western Urban,4,0.02 


If you instead want to view the CSV in Microsoft Excel, you can navigate to the same file in 
your computer's Finder/File Explorer and double-click on it from there. 


See ee E y 


REMINDER 
To locate your Project folder in your computer's Finder/File Explorer, go 
A the “Files” tab, click on the gear icon, then click “Show Folder in New 
Window”. 


ee esd 


P22 2] ae 


History Plots Packages Help Tutor 
) Blank File ~ © Delete | Rename 


drop Copy... 
Copy To... 

Move... 
eric 


Copy Folder Path to Clipboard 
REMINDER Jori 


ict 
T| < Open Each File in New Columns 


©? Open Selected in Source Pane 


Set As Working Directory 
Go To Working Directory 
v Synchronize Working Directory 


Open New Terminal Here 


Show Folder in New Window 


Show Hidden Files 


ne > ot 
7 Modifi 


lay 30, 
lay 30, 
lay 30, 


RsTupIio |f you are on RStudio cloud, then you won't be able to view the CSV in 

cLOUD Microsoft Excel until you have “exported” it. Use the “Export” menu 
option in the Files tab. If this is not immediately visible, click on the gear 
icon to bring up “More” options, then scroll through to find the “Export” 


option. 


Overwriting data 


If you need to update the output CSV, you can simply rerun the write csv() function 


with the updated data object. 


To test this, replace the “Cases by district” section of your script with the following code. 


It uses the arrange () function to arrange the table in order of the number of cases, n: 


if Cases Dy ATS EE IOC Sa 

Chis BIE tat alo << taby (eboli isiecrra leonem adiStrIGT) 
district itablarrongedi< arrange (distriectritab, EN) 
district _tab_ arranged 


( -n means “sort in descending order of the n variable’; we will introduce you to the 


arrange function properly later on.) 


The output should be: 


district n percent 
Kailahun 155 0.775 
Kenema 34 0.170 
Western Urban 4 0.020 
Bo 2 0.010 

Kono 2 0.010 

Port Loko 2 0.010 
Kambia 1 0.005 


You can now overwrite the old “district_table.csv” file by re-running the write_csv function 
with the district tab object: 


walte Cis (Ox = iclisieialoie: isle) clsiccuoeerel, Tile = 
here ("outputs/district table.csv") ) 


To verify that the dataset was actually updated, observe the “Modified” time stamp in the 
RStudio Files pane: 


A Home > Dropbox > Mac (2) > Desktop > ebola_analysis > outputs Ea s> 
t- 
E district_table.csv 132 B May 30, 2022, 9:29 PM 


Exporting plots to the “outputs” folder 
Finally, let’s look at plot exporting in the context of an RStudio Project. 


In the “Visualize categorical variables” section of your script, you should have: 


# Visualize categorical variables ---—- 
categ vars plot<- show plot(inspect cat (ebola_ sierra leone) ) 
Categ vars prot 


Running these code lines should give you this output: 


Frequency of categorical levels in df::ebola_sierra_leone 
Gray segments are missing values 


date_of_onset 2014-06-10 


date_of_sample 


district Kailahun Kenema || 


status confirmed 


Below these lines, type up the ggsave () command below (but don't run it yet): 

goseve (eal linens = jello eategivars ipot) 
This command uses the ggsave () function to export the categ vars plot figure. The 
plot argument of ggsave() takes in the object to be saved (in this case 


categ vars plot), and the filename argument takes in the target file path for the 
plot. 


As you saw when exporting data, this target file path is quite simple because you are 
working in an RStudio Project. In this case, you have: 


ggseve(iilename = Toutputs/categorical joll@e joc, plot = categ vars plot) 


Run this ggsave () command now. The path “outputs/categorical_plot.png” tells 
ggsave() to save the plot as a PNG file named “categorical_plot” in the “outputs” folder 
of the Project. 


To see this newly-saved plot, navigate to the Files tab. You can click on it to open it with 
your computer's default image viewer: 


20 


Environment History Plots Packages Help Tutorial Viewer Presentati = 
@)New Folder © New Blank File ~ © Delete 5)Rename {if More ~ T 


e A Home > Dropbox > Mac (2) > Desktop > ebola_sierra_leone > outputs ® s= 
4 Name Size Modified 
t.. 
D 176.4 KB May 30, 2022, 12:32 PM 
O [| district_table.csv 1328 May 30, 2022, 12:31 PM 


Also note that the the ggsave () function lets you save plots to multiple image formats. 
For example, you could instead write: 


ggsave (filename = "outpüts/categorical plot-pdf", plot = categ vars Plot) 
to save the plot as a PDF. Run ?ggsave to see what other formats are possible. 


Now let’s export the second plot, the numerical summary. In the section of your script 
called “Visualize numeric variables”, you should have: 


# Visualize numeric variables ---- 
num_vars plot <- show plot (inspect _num(ebola_sierra_ leone) ) 


num_vars plot 


Running these code lines should give you this output: 


Histograms of numeric columns in df::ebola_sierra_leone 


age id 
0.075 - 
0.10- 
D 
= 0.050 - 
fe} 
oO 
fel 
2 
Ò 
0.05 - 
| | 0.025 - 
0.00 - | ln. 0.000 - 
0 20 40 60 80 0 100 200 300 


21 


To export this plot, type up and run the following code: 


ggsave (filename = "outputs/numeric plot.png", plot = num_vars_ plot) 
Wonderful! 


Sharing a Project 

Projects are also great for sharing your analysis with collaborators. 

You can zip up your Project folder and send it to a colleague through email or through a 
file sharing service like Dropbox. The colleague can then unzip the folder, click on the 


.Rproj file to open the Project in RStudio, and re-do and edit all your analysis steps. 


This is a decent setup, but sending projects back and forth may not be ideal for long-term 
collaboration. So experienced analysts use a technology called g/t to collaborate on 
projects. But this topic is a bit too advanced for this course; we will cover it in detail ina 
future course. If you are impatient, you can check out this book chapter: https://intro2r 
.com/github_r.html 


Wrapping up 

Congratulations! You now know how to set up and use RStudio Projects! 

Hopefully you see the value of organizing your analysis scripts, data and outputs in this 

way. Projects are a coherent way to structure your analyses, and make it easy to revisit, 
revise and share your work. They will be the foundation for much of your work as a data 
analyst going forward. 


That's it for now. See you in the next lesson. 


Contributors 
The following team members contributed to this lesson: 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


References 
Some material in this lesson was adapted from the following sources: 
e Wickham, H., & Grolemund, G. (n.d.). R for data science. 8 Workflow: projects | R for 


Data Science. Retrieved May 31, 2022, from https://r4ds.had.co.nz/workflow-projects 
„html 


22 


This work is licensed under the Creative Commons Attribution Share Alike license. 


23 


Lesson notes | Data classes and structures 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


IODO: aaeanoa dries grids And aay ke Sarees Babee mage EE banner ye Ai eis A ance wt eae am eed Bs hea E S E 
Learning ODJECVES 6 nag ait aes oad rea OES ES oS ah EK Od A RE OOS OE ROG Oe ee 
PCIe one 5c tak ba os ear nen Woe done 24-4 eee E E 6s heey oeehes 64444-4064 a% 
Whatisadata tlass. saper cerida Ad A Ae DA ThA heh O A Ea ANOS RE REDE ENE 
NUMENG 2 ge, giete deh a re eure ohare eg Orbe fio dean k dh he E rae dow Brae ar acid Bate ad Be 
RUPEE au Rider wae gue Gace cum mi ea ace dee hee ars base eee, Goa Ge apie ew woes Gee eae ee 

Quotation mMaAKS sarsar ere eee aa EROS ee ee Oe KO eee eee eee Ke 
LOGICI esaeraren were een See eee eee ae Serer re ee eee ee ee ee re ret ee 

Logical values and relational operators ............ 0... eee 
Date abd oat he be teh ate ae bb epee eee + aha we aed beh ee E hee bee Bae bad bd oR ee oe 

Converting to the YYYY-MM-DD format with {lubridate}............ 0.0.0.0. ...000005 
ET COUCING VOCIONS «ons eeeane ded ase aad does ne 40h 4b Rio se ORS O20 sh4e 4G" $45 24 
Creating VOCUS: setea dented a eee eo 68 OSE OE SA Ae AED eS OO OE EEE eke ee 
Mamipulating VECIONS:: 4 ooo badd esd 260 asetati aie 486504840 Sod Od HS Oe COE E p RTE E 

A common mistake: missing vector notation............. 00.000 ee eee 

Shorthand functions for numerical vectors 2.4 j6a dss Jbedhe theewedeebead cevanbads 
FROM vectors TO Cate AMES 44.64.44 440544 04%444-049 6 59 2 hn eda Rea Rew ORES EEG ASRS 
TIDDIES aca than 86s 8 he ee eh ae HS RG a a He Fe ek He as ee 

řead Sev () creates VODIES srete Hee seprti ahnt anea GEE OF FOR EG ERE ERE SOS 
POS: 6o.6 5.06 ie ba bad OE AS PSR OAL OE TOEESEAEH ALLEL AE LORIE Ree Le one ee ee 
WOPO ofa ae bse bot a ek ap ia ee OSG aa enh RH Wed dae Ae Be 4 TR a 


Intro 


So far, we have focused on some of the important tools for working with data in R: the 
RStudio IDE, Rstudio projects and R Markdown. In this lesson, we will start to take a closer 
look at the ways that R stores data. 


This lesson introduces many new concepts. Make sure you type along with the tutorial, so 
that you can develop a strong recall of these. Open a script in RStudio and type each code 
section out yourself. 


Ss 


Learning objectives 


1. You can identify and create objects of the following classes: numeric, character, 
logical, date, and factor. 


2. You can use the class() and is.xxx() family of functions (e.g. is.numeric() to 
check data classes. 


3. You can use the as.xxx() family of functions (e.g. as. character ()) to convert 
between data classes. 


4. You know what relational operators (e.g. ==) are and can combine relational 
conditions with « and | 


5. You can use the ymd () -like functions from {lubridate} to parse date data. 
6. You can create vectors with the c() function. 
7. You can combine vectors into data frames. 


8. You understand the difference between a tibble and a data frame. 


Packages 


Please load the packages needed for this lesson with the code below: 


if(!require(pacman)) install.packages ("pacman") 
pacman: :p load(tidyverse, lubridate) 


What is a data class? 


A data class is a way of categorizing data based on the type of values that it can take. In R, 
there are five main data classes that you need to be aware of: numeric, character, logical, 
date and factor. Each is described in more detail below. 


Numeric 


Let’s start with the numeric class, which is used for data that contains, well, numbers. This 
could be an integer or a decimal, like 25 or 23.4. 


You can verify the class of numbers (or any other data type) with the built-in class () 
function: 


class (4) 


## [1] "numeric" 


class(0.1) 


## [1] "numeric" 


The function is.numeric() is also used to verify that an object is numeric: 


is.numeric (4) 


## [1] TRUE 


is.numeric("Bob") # Not numeric 


## [1] FALSE 


(Apart from is.numeric(), there is a whole family of other is .XXX functions, such as 
is.character(). You will see these below.) 


Numeric data can sometimes be represented with scientific notation, where “e” refers to 
“10 to the power of”. For example, we can write the number 2000 as 2 times 10 to the 
power of 3, 2e3: 


2e3 


## [1] 2000 


“Integers” (numbers without any decimal) are a special class of numbers, represented 
with an “L” after the number: 


4L 


class (4L) 


## [1] "integer" 


However, you usually should not have to use the L notation; to write a whole number like 
4, just write 4, not 4L. 


PRO TIP 


X 
X 


PRO TIP You may also see the terms “real” or “double” (abbreviated as “dbl”) used 
y to describe numeric data. The differences between these are only 
X relevant for advanced users. You can ignore them for now. 


Each line of code below tries to define a numeric object (the numbers 1 to 4). But all lines 
have a mistake, and do not properly define the object. Try to find the four mistakes, fix 
them and perform the assignment: 


mumerieckoon S ili ern nehe iaibinleisue’ Il 
numeric obj2 <- two # define the number 2 
memeri ekoo S) m “VS e nee benner 
mumericlob Ak m Ne rne che number 4710 
numeric Clog <= Si) Ge @lsiadiae) ehe salenmlosic S 
Character 


The character class is used for data that contains text. In R, we write text by putting it in 
quotation marks, like this: 


"A piece of text." 


We can check this object's class like so: 


class("A piece of text.") 


## [1] "character" 


is.character("A piece of text.") 


## [1] TRUE 


Note that if you wrap a number in quotes, it automatically becomes a character, 
according to R: 


class(4) # numeric 


## [1] "numeric" 


class("4") # Now a character 


## [1] "character" 


VOCAB 


Character values are sometimes referred to as “strings” or “character 
strings”. You can use these terms interchangeably. 


Quotation marks 


You can use either single, ', or double quotation marks, ", to create a character string. 
The tidyverse style guide, which we use, recommends that you use double quotes in most 
cases. 


Notably, if you start a string with a single quote, you must also close it with a single quote; 
the same goes for double quotes. So a string like "Hello World' is not properly defined 
and will cause an error, because it opens with double quotes, but closes with a single 
quote. 


If quotation marks are used to define character strings, what should you do when your 
string already has a quotation mark within it? 


For example, how would you create the following string Obi said "Hello World". 


Here, you have two options: you can double quotes internally, with single quotes to wrap 
the whole string: 


Gon Srne <= O Seya WSL Worek 
OPARE cING 


## [1] "Obi says \"Hello World\"" 


or you can use single quotes internally, with double quotes to wrap the whole string: 


obikotringi2 i s Obi Says n Helston Worl ei 
obi gsiirings2 


## [1] "Obi says 'Hello World'" 


If you try to use double quotes both for the internal quote, and externally to wrap the 
string, you will get an error: 


lol Sece 8) <— Woo eave MRE iol 


Error: object 'obi_string 3' not found 


The same thing will happen if you use single quotes internally and externally. 


Each line of code below tries to define a character object. But all lines EXCEPT ONE have a 
mistake, and do not properly define the object. Try to find the four mistakes, fix them and 
perform the assignment: 


Goer Yolonjecicll <> Peo Woxiel 
Chammobie Cr <= Yy mene e Hga. 
Cae Oloyjeces <= Vil cin 24" yeee ollel 
Cheni Cloyicers <= Viim eyeneiialiore; IRU 
char kobjectok = romudciramsetences 


Note that you cannot perform math operations on character values. For 
example, the code below gives an error: 


sqrt("100") # square root. Does not work 


Error in sqrt("100") : non-numeric argument to mathematical 


WATCH OUT function 


But we can convert that character into a numeric class with the function 
as.numeric(), and then the function will run: 


sogre (as- numeric ("100") 


The as.numeric() function seen above is part of a family of functions for converting 
between classes. There are many others. We could for example, convert a number into a 
character: 


as.character (4) 


#4 [1] wan 


Above, you can observe that the as. character (4) line prints its output with quotes, 
indicating that it is a character. 


__—— oo 


Logical 


Logical data contains two values, TRUE or FALSE, which are written in all capital letters. 
These can also be written in short as T and F. Logical data can be thought of like a light 
switch: it’s either on (TRUE) or off (FALSE). 


Logical values are often used as the arguments to functions, for example: 


mean (airqualitySOzone) 


mean(airqualitySOzone, na.rm = TRUE) # remove NAs to calculate mean 


## [1] 42.12931 


The second code line, with na. rm = TRUE is an example of the use of the logical TRUE 
value as the argument to a function. 


Each line of code below tries to define a logical object. But all lines, EXCEPT ONE, have a 
mistake, and do not properly define the object. Try to find the four mistakes, fix them and 
perform the assignment. 


eplesil elogi <= ic 
OGmcalobs Zu <j Haksic 


Gales Clogs <= Ve pis! 
Gree oou <= I 
OGuica eoo <= TRUS 


Logical values and relational operators 
Logical values are returned when you apply relational operators in R. 
A relational operator (sometimes called comparison operators) tests the relationship 


between two values. You will consider them in detail in a future lesson, but here we'll give 
a few examples: 


The greater-than, >, operator checks whether the left-hand-side (LHS) object is greater 
than the right-hand-side (RHS) object: 


3 > 4 # is 3 greater than 4? Answer is FALSE 


## [1] FALS 


GI 


The == comparator checks whether two values are equal: 


3 == 3 # is 3 equal to 3? Answer is TRUE. Note the double equals sign here. 


## [1] TRU 


GI 


The <= operator checks whether the LHS is less than or equal to the RHS object: 


3 <= 3 # is 3 less than or equal to 3? Answer is TRUE 


## [1] TRU 


GI 


Logical values can be combined using the ampersand, “&, which means “AND”, or the 
vertical bar, “|”, which means “OR”. 


& Checks whether ALL values are true: 


zi 


TRUE & TRUE # All values are true. Returns TRUE 


## [1] TRUI 


GI 


TRUE & FALSE # ALL values are not true. Returns FALSE 


## [1] FALS 


GI 


| checks whether AT LEAST ONE value is true: 


T 


TRUE | FALSE # At least one value is true. Returns TRUE 


+ 
+ 
m 
aE] 
yD 
G 
GI 


FALSE | FALSE # No value is true. Returns FALSE 


10 


## [1] FALSE 


It will become clearer how « and | work when you use these in the context of real 
datasets. So if it feels unclear now, feel free to ignore it. 


Predict whether each line below will evaluate to TRUE or FALSE. Then use your R code 
console to check these. 


is.numeric(5) 

5 == 

Sy > 6 

Ts nuüumenwe (Si || s) S= 6 
ro numer re (Sj) ke 3) SS 


Date 


The Date class contains dates, which must be formatted in YYYY-MM-DD format 


(e.g. "2022-12-31"), with four digits for the year, two digits for the month, and two 
digits for the day of the month. 


Of course if you just put in such a date string, R will initially consider this to bea 
character: 


Chass WA ZOZAIA Srl) 


## [1] "character" 


In order for R to recognize a data value as a date, you use the as.Date() function: 


my cara <= isis Dece("2022-12=s1") 
class (my date) 


## [1] "Date" 


WATCH OUT 


Note the capital “D" in the as.Date() function! 


With the date format, you can now do things like find the difference between two dates: 


aSr Dienee (0A 2A TE els) Dte (202212 —2(0)™) 


## Time difference of 11 days 


This would of course not be possible if you had bare characters: 


WT Oa Sy ERO 210)" 


Error in “2Z022=12=31" = “"2022=12=20" 
non-numeric argument to binary operator 


Note that if you use any other date format than YYYY-MM-DD, R's as.Date() function will 
not work: 


as. Date (V12/3172022 0) 7 Common America daca normat MM/DD/YY 
ase Date (CDpee sil, 2022) Connon: America date Format MM DIDAYYYY 


Error in charToDate (x) 
character string is not in a standard unambiguous format 


Each line of code below tries to define a date object. But all lines, EXCEPT ONE, have a 
mistake, and do not properly define the object. Try to find the four mistakes, fix them and 
perform the assignment. 


cere olgil <= ag Cars (V2022 -12751 
care Cloy2 <= as Ceta l2 02A2=1l2=30) 
Cere ooo <= aSa Dere (2022-12-29) 
date obj4 <- ask Date (112728720220) 
dare o T e 


Converting to the YYYY-MM-DD format with {lubridate} 


Because R only recognizes the “YYYY-MM-DD” format as a date, you will often have to 
convert from other date formats into this format. The {lubridate} package makes this 
very easy. It has a family of intelligent date-parsing functions, which are named in terms 
of the relative arrangement of year, month and date. 


So you have the mdy () function (which stands for month, day, year), damy () (day, month, 
year), ymd() (year, month, day), and so on. 


Run the lines of code below to observe how these lubridate date parsing functions 
work. 


mdy ("December 31 2001") 


## [1] "2001-12-31" 


mdy("Dec 31 2001") 


## [1] "2001-12-31" 


mova Dee Sik Onu 


## [1] "2001-12-31" 


anol (OIL Sil 2001") 


## [1] "2001-01-31" 


ymd ("2001 December 31st") 


## [1] "2001-12-31" 


yma (M200 DEC 3a) 


## [1] "2001-12-31" 


Do you see the beauty of this? You do not need to worry about whether a hyphen or a 
slash or a space was used to separate the dates, or whether the months were spelled out 
in full or abbreviated. All you need to know is the intended order of the date components 
(day, month and year) and voila! 


Note that the lubridate functions automatically do the as.Date() conversion to convert 
the values from characters to dates: 


class ("December 31 2001") 


## [1] "character" 


# recognized only as a character 


class (mdy ("December 31 2001")) 


as date 


# after applying lubridate, R recognizes value 


## [1] "Date" 


Convert the following to R dates with the ymd family of functions from lubridate: 


Marcho 2 O22. 
MAO Meien 26)" 
WAVQ2 Weve: 29W 

U2 OLAIMO S22)" 


R stores dates internally as the number of days since January 1, 1970. This 
means that the date January 1, 1970 is represented as 0, while 
January 2, 1970 is represented as 1, and so on. You can see this by 
running converting those dates to numbers: 


as.numeric(as.Date ("1970-01-01") ) 


# [1] 0 
PRO TIP 
: as.numeric(as.Date ("1970-01-02") ) 
X 
X 
## [1] 1 


as.numeric(as.Date ("2022-12-31") ) 


## [1] 19357 


This information is sometimes useful when importing date data. 


=O 


Introducing vectors 


So far, we have been looking at data that contains just a single value. But of course, most 
data comes in some kind of collection—a data structure. The data structure you are most 
familiar with is a data table (sometimes called a spreadsheet, but typically represented as 
a data frame in R). 


But the most fundamental data structures in R are actually vectors. Let’s spend some 
time thinking about these. 


A vector is a collection of values that all have the same class (for example, all numeric or 
all character). It may be helpful to think of a vector as a column or row in an Excel 
spreadsheet. 


Creating vectors 


Vectors can be created using the c() function, with the components of the vector 
separated by commas. For example, the codec(1, 2, 3) defines a vector with the 
elements 1, 2 and 3. 


In your script, define the following vectors: 


My, intense Wisxe <— C0), I, i, 27 3) 
my numeric vec # print the vector 


## [1] 011 2 3 


myenumerie NANG <= el, Sip sip Yh, Al) 
my numeric vec2 # print the vector 


## [1] 453 41 


my Charreter vac <> (MRC, Wane Males, “Olen, WAE) 
my character vec # print the vector 


#H [1] "Bob" "Jane" "Joe" "Obi" "Aka" 


wy Locole vee <= C(I We, 1, Ty 18) 
mye logicaliyeci 7A Prine Nene VeeEor 


## [1] TRUE TRUE FALSE FALSE FALSE 


You can also check the classes of these vectors: 


class (my numeric vec) 


## [1] "numeric" 


class (my character vec) 


## [1] "character" 


class (my_logical_vec) 


## [1] "logical" 


Each line of code below tries to define a vector with three elements. But all lines, EXCEPT 
ONE, have a mistake, and do not properly define the vector object. Try to find the four 
mistakes, fix them and perform the assignment. 


myavec IL <=" (il 275 33) 

MY MVC ClrA oA 

mi, vse d <= We 8, (6) ¥ 

myi vecii <a (GLObn a Chik eNOS Oly) 

my vee 5 <=> cis Dace (e(!Z0Z20=10—10", YZO20=10=1i, Y20Z20=10=113")} )) 


The individual values within a vector are called components or elements. 
So the vector c (1, 2, 3) has three components/elements. 


Manipulating vectors 


Many of the functions and operations you have encountered so far in the course can be 
applied to vectors. 


For example, we can multiply our my numeric vec object by 2: 


my numeric vec 


## [1] 011 2 3 


myanumeri cvece i2 


## [1] 02 2 4 6 


Notice that every element in the vector was multiplied by 2. 


Or, below we take the square root of my numeric _vec2: 


my numeric vec2 


## [1] 453 41 


sqrt (my numeric vec2) 


## [1] 2.000000 2.236068 1.732051 2.000000 1.000000 


You can also can add (numeric) vectors to each other: 


my numeric vec + my numeric vec2 


## [1] 4 6 4 6 4 


Note that the first element of my numeric vec is added to the first element of 
my numeric _vec2 and the second element of my numeric vec Is added to the second 
element of my numeric _vec2 and soon. 


Below are some other functions you could run on vector. Type them out in your console 
and observe their outputs: 


head(my character vec, 2) # first two elements 
table (my logical vec) 
length (my logical vec) 


sort (my numeric vec2) 
Sone (my character vec)! 7 eres 1m alphabetical order 


Consider the vector defined here, which holds the hand circumference, in inches, of 
children: 

Cee aioclyss <—- G(4 00, Do, Sctly Setig Seg 450) 
Each line of code below tries to run a function on this vector. But all lines, EXCEPT ONE, 


have a mistake, and do not properly carry out the function. Try to find the four mistakes 
and fix them. 


SumMicincumminchess 7. find ehe sum ton Ene Vector 
heceli; Cirerm imeiss) G7 (elds) (lols) iiet einas SulGimSiaes. ere Tele vacci 


sorti(cincumkinehes decreasing = agelisis))) 7? Gieleie jeloley Wis\encoie anne reas ng Order 
Table(cireumiinches) 4 frequency Eabie 
Janitors: rab (alsecibhil alionelasis)) ij; “saiaisyepsisialehye Cape 


## Error: <text>:1:6: unexpected symbol 
## 1: sum()circum inches 
## i 


A common mistake: missing vector notation 


An error that students frequently encounter involves failing to create a vector where one 
is needed. For example, consider the code line below: 


mean(1,2,3,4) 


## [1] 1 


Hmmm. The mean of 1, 2, 3 and 4 is definitely not 1. What is going on. This is happening 
because the primary argument to mean () must be a vector: 


mean(c(1,2,3,4)) 


Now all good! 
Watch out for this/ 


Using the median () function, find the median of this collection of numbers: 1, 5, 2, 
8, 9, 10, 11, 46, 23, 45, 2, 4, 5, 6.(Remember to put the number collection 
in a vector!) 


Shorthand functions for numerical vectors 


R has anumber of shorthand functions for creating numerical vectors. The most 
commonly used of these is the colon operator, :, which creates a sequence of integers: 


TeS A HOE On eo 


## [1] 123 45 


NOG aS Sipe OOM E IES] 


## [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 


You can also use the seq() function to create a sequence of numbers as a vector: 


For example, to create a sequence of numbers from 1 to 10, you can run: 


seq(from = 1, to = 10) 


By default, the increment is set to 1. To use a different increment, just change the value of 
the by argument. For example, to create a sequence with an increment of two: 


seq(from = 1, to = 10, by = 2) 


## [1] 13579 


It’s also possible to create descending sequences by using : or seq(): 


TOT 


Although we defined a vector as a collection of objects, in R even single 
values are technically considered vectors! You can check this with the 


PRO TIP is.vector() function: 


HSeViector(ie(l), 27.5) A ODV Tous iy a VECEOR 


wx, 


## [1] TRUE 


Ts Vector (I) z SE a Vecror! 


PRO TIP 


Ox, 


From vectors to data frames 


Now that we have a handle on creating vectors, let’s move on to the most commonly used 
object in R: data frames. A data frame is just a collection of vectors of the same length 
with some helpful metadata. We can create one using the data. frame () function. 


In the below example, we first create vector variables (age, sex and comorbidity) for six 
individuals: 


cce KS tei, Za, 25, Sa, 60, 72) 
sex <- CAM U ETES Ree IMIM SIMIN UTERE N) 
comorbidity <= CU, We E i, Ye) 


We can now use the data. frame () function to combine these into a single tabular 
structure: 


data_epi <- data.frame(age, sex, comorbidity) 
data_epi 


age sex comorbidity 
1 18 M TRUE 
2 25 F TRUE 
3 46 F FALSE 
4 54 M FALSE 
5 60 M FALSE 
6 72 F TRUE 


Note that instead of creating each vector separately, you can create your data frame 
defining each of the vectors inside the data.frame () function. 


cewe ejo <= Cara rreme (sci = Clop 237 We, M7 G0 U2), 
sex = e(a ME ua ave Wn UME YIM VP) 
Commercloukichiicy, = (HE paE pis ple pi aE) )) 


data_epi 


## age sex comorbidity 
## 1 18 M TRUE 


20 


## 2 25 F TRUE 
## 3 46 EF FALSE 
## 4 54 M FALSE 
## 5 60 M FALSE 
## 6 72 F TRUE 


We can check the class of this data frame: 


class (data_epi) 


## [1] "data.frame" 


SIDENOTE Most of the time you work with data in R, you will be importing it from 
external contexts. But it is sometimes useful to create datasets within R 
itself. It is in such cases that the data. frame () function will come in 
handy. 


To create a data frame from vectors, all the vectors must have the same 
length. Otherwise you will get an error. For example: 


WATCH OUT mydf <- data.frame(age = c(5, 9, 8), 


BENOr in datea.trame (age = e(5, 9, 8), sex*= COMI METY 
arguments imply differing number of rows: 3, 2 


To extract the vectors back out of the data frame, use the $ syntax. Run the following 
lines of code in your console to observe this. 


data_epiSage 
is.vector(data_epiSage) # verify that this column is indeed a vector 
Class (datamcpisage) it; Chec Cia CEES Of Ene MICC 


Earlier, we defined a data frame with the following vectors below. Combine these into a 
data frame, with the following column names: “number_of_children” for the numeric 
vector, “name” for the character vector and "is married” for the logical vector. 


mwe omeri vee <= C0; I, i, 2, <) 
mya eharacteriyvcc k (e(MIBjoo 4 Wdiclaiell, Toe ODT a EAk) 
my logical vec <= ¢(TRUE, TRUE, FALSE, FALSE, FATSE) 


21 


Use the data. frame () function to define a data frame in R that resembles the following 
table: 


room number_of_windows 


dining 3 
kitchen 2 
bedroom 5 


Tibbles 


The default version of tabular data in R is called a data frame, but there is another 
representation of tabular data provided by the t/dyverse package. It’s called a tibble, 
and it is an improved version of data. frame. 


You can create a tibble using the tibble() function. (Remember to import the tidyverse 
package to use its functions.) 


tibble epi <- tibble( 
age = eie 25, Wee bale 50, Gaye 
sex = c('M', 'F', 'F', 'M', 'M', 'RF'), 
comorbidity = e(r E E, m E N) 
) 

ieallololhe. Creal 


# A tibble: 6 x 3 
age sex comorbidity 
<dbl> <chr> <lgl> 
1 18 M TRUE 
2 25 F TRUE 
3 46 F FALSE 
4 54 M FALSE 
5 60 M FALSE 
6 72- F TRUE 


Notice that the tibble gives the data dimensions in the first line: 


&# A tibble: 6 x 3® 
age sex comorbidity 
<dbl> <chr> <lgl> 
1 18 M TRUI 
2 25°F TRUI 


c A 


And also tells you the data types, at the top of each column: 


22 


# A tibble: 6 x 3 
age sex comorbidity 

@ <dbl> <chr> <l1g1>%® 
1 18 M TRUI 
2 25 F TRU 


w w 


There, “dbl” stands for double (which is a kind of numeric class), “chr” stands for 
character, and *Igl” for logical. 


You can convert a data. frame to tibble using the as_tibble function: 


df <- data.frame ( 
age = els, 25, WG, 54A OW, 2) - 
Ve S rte (OAPs ic gals sera ed 
comorbidity = Cw, We ie je, lt, YE) 
) 


a_tibble <- as_tibble(df) 


a_tibble 


# A tibble: 6 x 3 
age sex comorbidity 
<dbl> <chr> <lgl> 
1 18 M TRUE 
2 25 F TRUE 
3 46 F FALSE 
4 54 M FALSE 
5 60 M FALSE 
6 72 F TRUE 


And you can convert a tibble back to a data frame with as.data.frame(): 


as.data.frame(a_tibble) 


age sex comorbidity 
1 18 M TRUE 
2 25 EF TRUE 
3 46 F FALSE 
4 54 M FALSE 
5 60 M FALSE 
6 72 F TRUE 


For your most of your data analysis needs, you should prefer tibbles over regular data 
frames. 


23 


read_csv() creates tibbles 


When you import data with the read_csv() function from {readr}, you get a tibble: 
ebola tib <- read _csv("https://tinyurl.com/ebola-data-sample") # Needs 


internet to run 
class (ebola_ tib) 


## [1] "spec _tbl_df" "tbl df" "tbl" "data.frame" 


But when you import data with the base read.csv() function, you get a data.frame: 
ebola df <- read.csv("https://tinyurl.com/ebola-data-sample") # Needs internet 


to run 
class (ebola_ df) 


## [1] "data.frame" 


Try printing ebola _ tib and ebola df to your console to observe the different printing 
behavior of tibbles and data frames. 


The iris data frame is one of R's built-in datasets. Convert it to a tibble with the 
as tibble() function. Then print it to your console. How does the tibble output differ 
from the original iris data frame? 


Factors 


Finally, let’s turn briefly to factors, which is another data class. We left this class until the 
end because understanding factors requires that you understand vectors. 


A factor is a nominal (categorical) variable with a set of known possible values called 
levels. 


Why might we use a factor class? The most common reason are: 


e to force characters to sort in a custom order 
e to show zero counts 


What these mean will become clear by considering an example. 


Imagine that you have a variable that records the month of birth for a number of infants: 


oleen monen <= COPEC; Waveney, Memty “Mista, Ore, “inten, Mulia, SA) 


24 


And you want to count the number of births per month. You could use the base table () 
function: 


table (birth month) 


## birth month 
## Apr Dec Jan Mar Nov Oct 
## 2 1 2 1 1 1 


We see that 2 babies were born in April, 1 in December, and so on. 


You could also use the taby1 () function from {janitor} for this: 


tabyl (birth_month) 


birth month n percent 
Apr 2 0.250 
Dec 1 0:4125 
Jan 2 0.250 
Mar 1 0125 
Nov 1 0.125 
Oct. 1 0.125 


But do you see a problem with these outputs? The months are sorted in alphabetical 


order(!) Indeed if you try to sort () the birth month vector directly, you will note the 
same thing: 


sort (birth month) 


#H [1] "Apr" "Apr" "Dec" "Jan" "Jan" "Mar" "Nov" "Oct" 


But this is not a sensible way to sort this variable. A chronological order, with January first, 
would be much better. 


You can fix this problem with a factor. To create a factor you use the factor () function, 
with your original character vector and a list of valid levels, arranged in the correct order: 


pinthiimonthi factor Aeron biren month, 
eves =e (MJanti Whe eM Male Ur Wao I 
Wwe Molibiol . Urano. AG 
Seo, OSES Yow, ADEC) 


Class (orri nomie eevee) y Choad ies! Case 


## [1] "factor" 


losca Mene Ieee a jovewshe Bic 


## [1] Dec Apr Jan Mar Oct Nov Jan Apr 
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 


Notice that the levels are listed in the output. 


[1] Dec Apr Jan Mar Oct Nov Jan Apr 
“Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 


Now, we can sort the vector properly: 


sort (birth month factor) 


## [1] Jan Jan Mar Apr Apr Oct Nov Dec 
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 


And if we create the frequency count tables, we get the right order: 


table (birth month factor) 


## birth month factor 
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
++ 2 0 1 2 0 0 0 0 0 1 1 1 


tabyl (birth _month_factor) 


birth month factor n percent 
Jan 2 0.250 
Feb 0 0.000 
Mar 1 0-125 
Apr 2 0.250 
May 0 0.000 
Jun 0 0.000 
Jul 0 0.000 
Aug 0 0.000 
Sep 0 0.000 
Oct 1 0.125 
Nov 1 0125 
Dec 1 0.125 


As you can see, months with zero counts are also included in the table outputs. This will 


often be usefull! 


26 


If you would rather not see these zero count months, the taby1 () function allows you to 
drop them, by using the show missing levels argument: 


S 


tabyl (birth month factor, estoy missing levels = FALSE) 


birth_month factor n percent 
Jan 2 0.250 
Mar 1 0.125 
Apr 2 0.250 
Oct. 1 0.125 
Nov 1 0.125 
Dec 1 0.125 


The variable visit day below records the day of the week that a clinic was visited. 


visit day Z= e("Mon", WiMiioial SERVIE Ee ELA A "Wed", wate Wee Te) 


Convert this into a factor with the factor () function. The levels should be in order of the 
days of the week, starting with “Sun” and ending with “Sat”. 


Then create a frequency table of this variable using the taby1 () function. Does the table 
sort in the proper chronological order? 


Wrap-up 


You've learned a lot in this lesson! You now know about all of the basic R data classes 
(numeric, character, logical, date, factor) and how to create objects of each class. You 
also know how to check an object's class with class () and convert between classes with 
as.xxx(). Finally, you know how to create vectors and data frames. 


With this knowledge, you are now ready to start doing some serious data analysis in R. In 
the coming lessons, you'll start to learn about the dplyr package, which will provide you 
with powerful tools for manipulating your data frames and tibbles. 


Congratulations on making it this far! You have covered a lot and should be proud of 
yourself. 


Contributors 
The following team members contributed to this lesson: 


P DANIEL CAMARA 


Data Scientist at the GRAPH Network and fellowship as Public Health 
researcher at Fiocruz, Brazil 


Passionate about lots of things, especially when it involves people leading 
lives with more equality and freedom 


EDUARDO ARAUJO 


Student at Universidade Tecnologica Federal do Parana 
Passionate about reproducible science and education 


LAURE VANCAUWENBERGHE 


Data analyst, the GRAPH Network 
A firm believer in science for good, striving to ally programming, health 
and education 


B KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 
References 
Some material in this lesson was adapted from the following sources: 


e Wickham, H., & Grolemund, G. (n.d.). R for data science. 15 Factors | R for Data 
Science. Accessed October 26, 2022. https://r4ds.had.co.nz/factors.html. 


This work is licensed under the Creative Commons Attribution Share Alike license. 


Lesson notes | Selecting and renaming 
columns 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


LMROAWE GOR ovr ota mae nt wes eae in 4s wa ma ee les cade eh head bee 
Learnino COPCIVES: . oo. a nga end ees e6 o 464 6-5 hE. Chee OE OOS OOS REGS a 
The Yaounde COVID-19 dataset .. .sccndd ade ved beh 4esGb eb ep oad eee RE wt beens eree ws 
VO CG A Se) erer cerne eases Shae Ta Rae RSAC E O SE SREB EEO ES 
Selėcting COMMA ranges WIEN è accede vs ces see atedteetetoerGardaneaee EER UER E 
Excluding Columns VAG! 621665050 tas $48 tiria FHL oS ERE eA SSE SSE PL OH ERE 
Hélper fumctions forsetet) unco a554444$055064944 402 PHA TU RIEE EER GO RECESS RE ES 
stares wirt() Sn ers WHEN) oos004000 46449449 de eho e ee Ree Ree mi OnsO eo -48 
CONTAINS) erderia eenaa p ee eee mae eee Ra ed eee AeA a See wh Babe eed eee 
SVErVEhing () suse ees ae ade dees eedasdasd dea dadveaR dance ace eeca soe sibosid goa Ge 
Change column names With rename () «ssc ish csi teese ee set beeeneanwrdeniweae eae 
Rename within select () oc. eee ee eee 
WOP Upd 9.0. 4.8-69 5414644. 554, Fe ESOS SOG a e oe Bee be Roe EE 


Introduction 


Today we will begin our exploration of the {dplyr} package! Our first verb on the list is 
select which allows to keep or drop variables from your dataframe. Choosing your 
variables is the first step in cleaning your data. 


| E 
> © select() 


Fig: the select () function. 


Let’s go! 


Learning objectives 


e You can keep or drop columns from a dataframe using the dplyr::select () 
function from the {dplyr} package. 


e You can select a range or combination of columns using operators like the colon (:), 
the exclamation mark (!), and the c() function. 


e You can select columns based on patterns in their names with helper functions like 
starts with(),ends_ with(), contains(),and everything (). 


e You can use rename () and select () to change column names. 


The Yaounde COVID-19 dataset 


In this lesson, we analyse results from a COVID-19 serological survey conducted in 
Yaounde, Cameroon in late 2020. The survey estimated how many people had been 
infected with COVID-19 in the region, by testing for IgG and IgM antibodies. The full 
dataset can be obtained from Zenodo, and the paper can be viewed here. 


Spend some time browsing through this dataset. Each line corresponds to one patient 
surveyed. There are some demographic, socio-economic and COVID-related variables. The 
results of the IgG and IgM antibody tests are in the columns igg result and 


igm result. 


yaounde <- read_csv (here: :here("data/yaounde data.csv") ) 


yaounde 

# A tibble: 5 x 53 

id 

<chr> 
1 BRIQUETERIE 000 0001 
2 BRIQUETERIE 000 0002 
3 BRIQUETERIE 000 0003 
4 BRIQUETERIE 002 0001 
5 BRIQUETERIE 002 0002 
# .. with 4 
# sex <c 


date su 
<date> 
2020-10 
2020-10 
2020-10 
2020-10 


2020-10 


rveyed 


-22 
-24 
-24 
-22 
-22 


age 
<dbl> 
45 
55 
23 
20 
55 


9 more variables: age_category 3 
hr>, highest_education <chr>, 


age_category 


<chr> 

45 - 64 
45 - 64 
15. =" 29 
15 = 29 
45 - 64 
<chr>, 


occupation <chr>, 


Left: the Yaounde survey team. Right: an antibody test being administered. 


Introducing select () 


dplyr::select(B,C,E) 


Fig: the select () function. (Drawing adapted from Allison Horst). 


dplyr::select() lets us pick which columns (variables) to keep or drop. 


We can select a column by name: 


yaounde %>% select (age) 


## # A tibble: 5 x 1 
## age 


## <db1> 


## 1 45 
## 2 55 
## 3 23 
#t 4 20 
## 5 55 


Or we can select a column by position: 


yaounde %>% select(3) # ‘age is the 3rd column 


# A tibble: 5 x 1 


se OSE H OSE 


+ 
i 
i 
i 
i 
+ 
i 
i 


Oe WNhN F 
N 
W 


To select multiple variables, we separate them with commas: 


yaounde @>%s sellect(age, sex, i1gg result) 


## # A tibble: 971 x 3 
# age sex igg result 
#4 <dbl> <chr> <chr> 
## 1 45 Female Negativ 
## 2 55 Male Positive 
## 3 23 Male Negative 
# 4 20 Female Positive 
## 5 55 Female Positive 
## 6 17 Female Negativ 
7 13 Female Positive 

# 8 28 Male Negative 
# 9 30 Male Negative 
## 10 13 Female Positive 
## # .. with 961 more rows 
PRACTICE 


e Select the weight and height variables in the yaounde data frame. 


(in RMD) e Select the 16th and 22nd columns in the yaounde data frame. 


For the next part of the tutorial, let’s create a smaller subset of the data, called yao. 


yao <- 


yaounde %>% 


yao 


# A tibble: 
age 
<db1l> 
45 
55 
23 
20 
55 


sex 
<chr 
Fema 
Male 
Male 
Fema 
Fema 


He dk OTB WN FE 


igg resu 


select (age, 


sex, 


highest education, 
occupation, 


ïs smoker, 


is pregnant, 


1gg_ result, 
igm_result) 


5 x 8 


highest education occupation 


> <chr> 
le Secondary 
University 
University 
le Secondary 


le Primary 


. with 3 more variables: 
igm result <chr> 


lt <chr>, 


Selecting column ranges with : 


<chr> 

Informal worker 

Salaried worker 

Student 

Student 

Trader--Farmer 
is pregnant <chr>, 


The : operator selects a range of consecutive variables: 


yao %>% select (age: occupation) 


# A tibble: 
age 
<db1l> 
45 
55 
23 
20 
55 


sex 
<chr 
Fema 
Male 
Male 
Fema 
Fema 


Oe WN F 


5 x 4 


highest education occupation 


> <chr> 
le Secondary 
University 
University 
le Secondary 


Primary 


Le 


<chr> 

Informal worker 
Salaried worker 
Student 

Student 
Trader--Farmer 


We can also specify a range with column numbers: 


yao 


tt 
tt 
tt 
tt 
tt 


S>S select (1 
# A tibble: 
age sex 

<dbl> <chr 

1 45 Fema 

2 55 Male 


24) 


5 x 4 


# Select columns 1 to 4 


highest education occupation 


> <chr> 
le Secondary 
University 


<chr> 
Informal worker 
Salaried worker 


# Select all columns from 


is smoker 
<chr> 

Non-smoker 
Ex-smoker 
Smoker 

Non-smoker 
Non-smoker 


‘age to 


occupation 


## 3 23 Male University Student 


## 4 20 Female Secondary Student 
## 5 55 Female Primary Trader--Farmer 
PRACTICE 


e With the yaounde data frame, select the columns between 
symptoms and sequelae, inclusive. (“Inclusive” means you should 
also include symptoms and sequelae in the selection.) 


(in RMD) 
Excluding columns with ! 


The exclamation point negates a selection: 


yao %>% select(!age) # Select all columns except `age` 


## # A tibble: 5 x 7 
4 sex highest education occupation is smoker 
<chr> <chr> <chr> <chr> 
1 Female Secondary Informal worker Non-smoker 
## 2 Male University Salaried worker Ex-smoker 
## 3 Male University Student Smoker 
4 Female Secondary Student Non-smoker 
5 Female Primary Trader--Farmer Non-smoker 
## # .. with 3 more variables: is pregnant <chr>, 
Ht # igg result <chr>, igm result <chr> 


To drop a range of consecutive columns, we use, for example, ! age: occupation: 


yao %>% select(!age:occupation) # Drop columns from ‘age’ to ‘occupation’ 


# A tibble: 5 x 4 
is smoker is pregnant igg result igm result 

# <chr> <chr> <chr> <chr> 
## 1 Non-smoker No Negativ Negativ 
## 2 Ex-smoker <NA> Positive Negative 

3 Smoker <NA> Negativ Negativ 

4 Non-smoker No Positive Negative 
## 5 Non-smoker No Positive Negative 


To drop several non-consecutive columns, place them inside !c(): 


yao s>s select(!c(age, sex, igg result) ) 


## # A tibble: 5 x 5 
## highest education occupation is smoker is pregnant 


<chr> <chr> <chr> <chr> 

1 Secondary Informal worker Non-smoker No 
## 2 University Salaried worker Ex-smoker <NA> 
## 3 University Student Smoker <NA> 

4 Secondary Student Non-smoker No 

5 Primary Trader--Farmer Non-smoker No 
## # .. with 1 more variable: igm result <chr> 
PRACTICE 

e From the yaounde data frame, remove all columns between 
highest education and consultation, inclusive. 

(in RMD) 


Helper functions for select () 


dplyr has a number of helper functions to make selecting easier by using patterns from 
the column names. Let’s take a look at some of these. 


starts with() andends_ with() 


These two helpers work exactly as their names suggest! 


Wel) ern SSIS (oranie w en (Wis A Ge Couloimas eee SCENA eel aici 


# A tibble: 5 x 2 
is_smoker is pregnant 

#4 <chr> <chr> 
Non-smoker No 
Ex-smoker <NA> 

Smoker <NA> 
Non-smoker No 
Non-smoker No 


Oe UNEB 


Wale) ses Saler (ence wita (W restii n Conine lazis Sael mee MASTEN 


## # A tibble: 5 x 2 

igg result igm result 
<chr> <chr> 
Negativ Negativ 
Positive Negative 


Negativ Negativ 
Positive Negative 


Ow Wn Fr 


Positive Negative 


contains () 


contains () helps select columns that contain a certain string: 


yaounde %>% select(contains("drug")) # Columns that contain the string "drug" 


# A tibble: 5 x 12 
drugsource is drug _parac is drug antibio 
<chr> <db1l> <dbl> 
1 Self or familial 1 0 
2 <NA> NA NA 
3 <NA> NA NA 
4 Self or familial 0 1 
5 <NA> NA NA 
# .. with 9 more variables: is drug hydrocortisone <dbl>, 
# is drug other anti inflam <dbl>, 


everything () 


Another helper function, everything (), matches all variables that have not yet been 
selected. 


i? opasi alls) jometeNepatelinle y ENE hyena EE Cloulibigin, 
yao %>% select(is pregnant, everything()) 


# A tibble: 5 x 8 
is pregnant age sex highest education occupation 
<chr> <dbl> <chr> <chr> <chr> 
1 No 45 Female Secondary Informal worker 
2 <NA> 55 Male University Salaried worker 
3 <NA> 23 Male University Student 
4 No 20 Female Secondary Student 
5 No 55 Female Primary Trader--Farmer 
# .. with 3 more variables: is smoker <chr>, 
# igg result <chr>, igm result <chr> 


It is often useful for establishing the order of columns. 


Say we wanted to bring the is pregnant column to the start of the yao data frame, we 
could type out all the column names manually: 


10 


yao %>% select(is pregnant, 
age, 
sex, 
highest education, 
occupation, 
is_smoker, 
Leo eestihe, 
igm result) 


# A tibble: 5 x 8 
is pregnant age sex highest education occupation 
<chr> <dbl> <chr> <chr> <chr> 
1 No 45 Female Secondary Informal worker 
2 <NA> 55 Male University Salaried worker 
3 <NA> 23 Male University Student 
4 No 20 Female Secondary Student 
5 No 55 Female Primary Trader--Farmer 
# .. with 3 more variables: is smoker <chr>, 
# igg result <chr>, igm result <chr> 


But this would be painful for larger data frames, such as our original yaounde data frame. 
In such a case, we can use everything (): 


ie Bringi ioi pregnant jake) Cher ront (one jelatey Wlehce) saiatebinle 
yaounde s>%s select(is pregnant, everything()) 


# A tibble: 5 x 53 
is pregnant id date surveyed age 
<chr> <chr> <date> <dbl> 
1 No BRIQUETERIE 000 0001 2020-10-22 45 
2 <NA> BRIQUETERIE 000 0002 2020-10-24 55 
3 <NA> BRIQUETERIE 000 0003 2020-10-24 23 
4 No BRIQUETERIE 002 0001 2020-10-22 20 
5 No BRIQUETERIE 002 0002 2020-10-22 55 
# .. with 49 more variables: age category <chr>, 
# age category 3 <chr>, sex <chr>, 


This helper can be combined with many others. 


# Bring columns that end with "result" to the front of the data frame 
yaounde %>% select(ends with("result"), everything() ) 


# A tibble: 5 x 53 
igg result igm result id date surveyed age 
<chr> <chr> <chr> <date> <db1> 
1 Negativ Negativ BRIQUETERIE 000... 2020-10-22 45 
2 Positive Negative BRIQUETERIE 000... 2020-10-24 55 
3 Negativ Negativ BRIQUETERIE 000.. 2020-10-24 23 
4 Positive Negative BRIQUETERIE 002.. 2020-10-22 20 


## 5 Positive Negative BRIQUETERIE 002... 2020-10-22 55 
## # .. with 48 more variables: age category <chr>, 
Ht # age category 3 <chr>, sex <chr>, 


PRACTICE e Select all columns in the yaounde data frame that start with “is_”. 


e Move the columns that start with “is_” to the beginning of the 
(in RMD) yaounde data frame. 


Change column names with rename () 


RENAME COLUMNS 
dolyr::rename(enemies = species) 


diet 


søkes enemies . ee 


Dog carnivore 


House cat carnivore 


Osprey carnivore 


Fig: the rename () function. (Drawing adapted from Allison Horst) 


dplyr::rename() is used to change column names: 


A Rename age Vand See Co petent age anad pacient Sex 
yaounde %>% 


rename (patient_age = age, 
Pactemuusexs—esex)) 
## # A tibble: 5 x 53 
## id date surveyed patient _age age category 
#4 <chr> <date> <dbl> <chr> 
## 1 BRIQUETERIE 000 00... 2020-10-22 45 45 - 64 
## 2 BRIQUETERIE 000 00... 2020-10-24 55 45 - 64 


## 3 BRIQUETERIE 000 00... 2020-10-24 23 15 - 29 
## 4 BRIQUETERIE 002 00... 2020-10-22 20 15 - 29 
## 5 BRIQUETERIE 002 00... 2020-10-22 55 45 - 64 
## # .. with 49 more variables: age category 3 <chr>, 

HE F patient_sex <chr>, highest_education <chr>, 

WATCH OUT 


The fact that the new name comes first in the function 
(rename (NEWNAME = OLDNAME) ) is sometimes confusing. You should get 
used to this with time. 


Rename within select () 


You can also rename columns while selecting them: 


G gelede age Vand sex and mareme eaan go  Ppattentzaage Sand | parrenensex 
yaounde %>% 
select (patient _age = age, 
Patient _sex = sex) 


# A tibble: 5 x 2 
patient _age patient sex 

<dbl> <chr> 

al 45 Female 

2 55 Male 

3 23 Male 

4 20 Female 

5 55 Female 

Wrap Up! 


| hope this first lesson has allowed you to see how intuitive and useful the {dplyr} verbs 
are! This is the first of a series of basic data wrangling verbs: see you in the next lesson to 
learn more. 


14 


Beil Ei select() 
A >E 
E E 
E E 
Basic Fa 


Wrangling 


T a 


Coming soon ! 


Coming soon ! 


Fig: Basic Data Wrangling Dplyr Verbs. 


Contributors 


The following team members contributed to this lesson: 


LAURE VANCAUWENBERGHE 


Data analyst, the GRAPH Network 
A firm believer in science for good, striving to ally programming, health 
and education 


Q ANDREE VALLE CAMPOS 


R Developer and Instructor, the GRAPH Network 
Motivated by reproducible science and education 


B KENE DAVID NWOSU 
Data analyst, the GRAPH Network 


Passionate about world improvement 


References 


Some material in this lesson was adapted from the following sources: 


¢ Horst, A. (2021). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original 
work published 2020) 


e Subset columns using their names and types—Select. (n.d.). Retrieved 31 December 
2021, from https://dplyr.tidyverse.org/reference/select.htm! 


Artwork was adapted from: 


e Horst, A. (2021). R & stats illustrations by Allison Horst. https://github.com 
/allisonhorst/stats-illustrations (Original work published 2018) 


Lesson notes | Filtering rows 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


DFO! 5 -5.016.d'n hake dees dete es 
Learning objectives ..... 


The Yaounde COVID-19 dataset .... 0... nanana cee eee een ee 


Introducing filter()... 
Relational operators .... 


Combining conditions WITH Gand | vo.6 260445564 64945 40644 oHHF ORE ESERE HDRES EES 
MeGgating conditions WHA!) see e's saias e sega ede mre dys hb therheedetdege aTa ima a i 


NA values . oc suse cue wees 
Wrap UDI 264244644404 


Intro 


Onward with the {dplyr} package, discovering the filter verb. Last time we saw how to 
select variables (columns) and today we will see how to keep or drop data entries, rows, 
using filter. Dropping abnormal data entries or keeping subsets of your data points is 
another essential aspect of data wrangling. 


Let's go! 


> Ean 


EER filter() 
2 


Learning objectives 


1. You can use dplyr: 


:filter() to keep or drop rows from a dataframe. 


2. You can filter rows by specifying conditions on numbers or strings using relational 
operators like greater than (>), less than (<), equal to (==), and not equal to (!=). 


3. You can filter rows by combining conditions using logical operators like the 
ampersand (&) and the vertical bar (|). 


4. You can filter rows by negating conditions using the exclamation mark (!) logical 


operator. 


5. You can filter rows with missing values using the is.na() function. 


The Yaounde COVID-19 dataset 


In this lesson, we will again use the data from the COVID-19 serological survey conducted 
in Yaounde, Cameroon. 


yaounde <- read csv (here: :here('data/yaounde data.csv')) 
## a smaller subset of variables 
yao <- yaounde %>% 
select(age, sex, weight kg, highest education, neighborhood, 
occupation, is smoker, is pregnant, 
she; aetsysiblilic.  abenil setSisibllL ic) 
yao 


# A tibble: 5 x 10 
age sex weight _ kg highest education neighborhood 
<dbl> <chr> <db1l> <chr> <chr> 
1 45 Female 95 Secondary Briqueterie 
2 55 Male 96 University Briqueterie 
3 23 Male 74 University Briqueterie 
4 20 Female 70 Secondary Briqueterie 
5 55 Female 67 Primary Briqueterie 
# .. with 5 more variables: occupation <chr>, 
# is smoker <chr>, is pregnant <chr>, igg result <chr>, 


Introducing filter () 


We use filter() to keep rows that satisfy a set of conditions. Let’s take a look ata 
simple example. If we want to keep just the male records, we run: 


Q 


yao %>% filter(sex == "Male") 


# A tibble: 5 x 10 
age sex weight_kg highest education neighborhood 
<dbl> <chr> <dbl> <chr> <chr> 

1 55 Male 96 University Briqueterie 
2 23 Male 74 University Briqueterie 
3 28 Male 62 Doctorate Briqueterie 
4 30 Male 73 Secondary Briqueterie 
5 42 Male 71 Secondary Briqueterie 


## # .. with 5 more variables: occupation <chr>, 
Ht # is smoker <chr>, is pregnant <chr>, 


Note the use of the double equal sign == rather than the single equal sign =. The == sign 


tests for equality, as demonstrated below: 


Ti create the object sex vector with three elements 


igg result <chr>, 


SexViCCEOre eMe Eemien emake) 
## test which elements are equal to "Male" 
SCxaVCCEOR == a Marker 
## [1] TRUE FALSE FALSE 
So the code yao %>% filter(sex == "Male") will keep all rows where the equality test 


sex == "Male" evaluates to TRUE. 


It is often useful to chain filter () with nrow() to get the number of rows fulfilling a 


condition. 


## how many respondents were male? 
yao S>% 

filter (sex == "Male") %>% 

nrow () 


KEY POINT 


y ‘ 


~~ 
- 


PRACTICE Filter the yao data frame to respondents who were pregnant during the 


survey. 


(in RMD) 


` - The double equal sign, ==, tests for equality, while the single equals sign, 
=, is used for specifying values to arguments inside functions. 


How many respondents were female? (Use filter() and nrow() ) 


p 
(E E E | 


Wor m 


DEI 


Relational operators 


The == operator introduced above is an example of a “relational” operator, as it tests the 
relation between two values. Here is a list of some of these operators: 


Operator is TRUE if 


A<B 
A<=B 


Al=B 


A is less than B 
A is less than or equal to B 


A is greater than B 
A is greater than or equal to B 


A is equal to B 


A is not equal to B 
A %in% B Ais an element of B 


1A (Not A) 


A&B (AandB) 


Fig: AND and OR operators visualized. 


Let’s see how to use these within 4 


Filter(): 


A|B (AorB) 


! (A &B) (Not A and Not B) 


Oy 


‘sex 


e 
= 
= 
e 
e 


is not "Male" 


terie 
terie 
terie 
terie 
terie 


yao %>% filter(sex != "Male") ## keep rows where 
# A tibble: 5 10 
age sex weight _ kg highest_education neighborhood 
<dbl> <chr> <dbl> <chr> <chr> 
1 45 Female 95 Secondary Briqu 
2 20 Female 70 Secondary Briqu 
3 55 Female 67 Primary Briqu 
4 17 Female 65 Secondary Briqu 
5 13 Female 65 Secondary Briqu 
# .. with 5 more variables: occupation <chr>, 
# is smoker <chr>, is pregnant <chr>, igg resu 


1 


E-<ehr>,; 


yao %>% filter(age < 6) ## keep respondents under 6 


## # A tibble: 5 x 10 

# age sex weight _ kg highest education neighborhood 
#4 <dbl> <chr> <dbl> <chr> <chr> 
## 1 5 Female 19 Primary Carriere 
## 2 5 Female 26 Primary Carriere 

## 3 5 Male 16 Primary Cité Verte 

# 4 5 Female 21 Primary Ekoudou 

## 5 5 Male 15 Primary Ekoudou 

## # .. with 5 more variables: occupation <chr>, 

# # is smoker <chr>, is pregnant <chr>, igg result <chr>, 


yao %>% filter(age >= 70) ## keep respondents aged at least 70 


## # A tibble: 5 x 10 

# age sex weight _ kg highest education neighborhood 
#4 <dbl> <chr> <dbl> <chr> <chr> 
## 1 78 Male 95 Secondary Briqueterie 

# 2 79 Female 40 Primary Briqueterie 

## 3 78 Female 60 Primary Briqueterie 

## 4 75 Male 74 Primary Briqueterie 

## 5 72 Male 65 Secondary Carriere 

## # .. with 5 more variables: occupation <chr>, 

# # is smoker <chr>, is pregnant <chr>, igg result <chr>, 


## keep respondents whose highest education is "Primary" or "Secondary" 
yao so maaliber (highest educabvon sine C\(UPrimary Wa PSccondar yah) 


## # A tibble: 5 x 10 
#4 age sex weight kg highest education neighborhood 
#4 <dbl> <chr> <dbl> <chr> <chr> 

# 1 45 Female 95 Secondary Briqueterie 
## 2 20 Female 70 Secondary Briqueterie 

## 3 55 Female 67 Primary Briqueterie 

# 4 17 Female 65 Secondary Briqueterie 

## 5 13 Female 65 Secondary Briqueterie 

## # .. with 5 more variables: occupation <chr>, 

Ht # is smoker <chr>, is pregnant <chr>, igg result <chr>, 


PRACTICE From yao, keep only respondents who were children (under 18). 


Og With %in%, keep only respondents who live in the “Tsinga” or “Messa” 
(in RMD) neighborhoods. 


Combining conditions with « and | 


We can pass multiple conditions to a single filter() statement separated by commas: 


## keep respondents who are pregnant and are ex-smokers 


yao sos elles (ils pregaant == “wes, de sinolysie == Vite sinlelisese'y)) Fix? 
## # A tibble: 1 x 10 
# age sex weight _ kg highest education neighborhood 
# <dbl> <chr> <dbl> <chr> <chr> 
## 1 25 Female 90 Secondary Carriere 
## # .. with 5 more variables: occupation <chr>, 
Ht # is smoker <chr>, is pregnant <chr>, igg result <chr>, 


only one row 


When multiple conditions are separated by a comma, they are implicitly combined with 


an and («). 


It is best to replace the comma with & to make this more explicit. 


## same result as before, but `&` is more explicit 


yao os mlrs (aS) pregpent SS est (als Silos SS Wn EIOS) 
## # A tibble: 1 x 10 
# age sex weight _ kg highest education neighborhood 
# <dbl> <chr> <dbl> <chr> <chr> 
## 1 25 Female 90 Secondary Carriere 
## # .. with 5 more variables: occupation <chr>, 
Ht # is smoker <chr>, is pregnant <chr>, igg result <chr>, 


Don't confuse: 


SIDE NOTE 


i — re ee et ll 


e the *” in listing several conditions in filter filter (A,1 
based on condition A and («) condition B 


e the“ in lists c (A,B) which is listing different components of the 
list (and has nothing to do with the & operator) 


B) i.e. filter 


ee es ee ee ne m 


If we want to combine conditions with an or, we use the vertical bar symbol, |. 


## respondents who are pregnant OR who are ex-smokers 


yao wos wiles (LS precpene SS Viet || is Smoker SS Wit = mmole) 
# A tibble: 5 x 10 
age sex weight _kg highest education neighborhood 
4 <dbl> <chr> <dbl> <chr> <chr> 
## 1 55 Male 96 University Briqueterie 
2 42 Male 71 Secondary Briqueterie 
3 38 Male 71 University Briqueterie 
## 4 69 Male 108 University Briqueterie 
## 5 65 Male 93 Secondary Briqueterie 
# .. with 5 more variables: occupation <chr>, 
# is smoker <chr>, is pregnant <chr>, igg result <chr>, 


PRACTICE Filter yao to only keep men who tested IgG positive. 


(in RMD) 


Filter yao to keep both children (under 18) and anyone whose highest 
education Is primary school. 


Negating conditions with ! 


To negate conditions, we wrap them in ! (). 


Below, we drop respondents who are children (less than 18 years) or who weigh less than 


30kg: 


## drop respondents < 18 years OR < 30 kg 
yao szo filter(! (age < 18 | 


weight _kg < 30)) 


# A tibble: 5 x 10 
i age sex weight _ kg highest_education neighborhood 
4 <dbl> <chr> <dbl> <chr> <chr> 
1 45 Female 95 Secondary Briqueterie 
2 55 Male 96 University Briqueterie 
3 23 Male 74 University Briqueterie 
## 4 20 Female 70 Secondary Briqueterie 
## 5 55 Female 67 Primary Briqueterie 
# .. with 5 more variables: occupation <chr>, 
# is smoker <chr>, is pregnant <chr>, igg result <chr>, 


The ! operator is also used to negate %in% since R does not have an operator for NOT in. 


## drop respondents whose highest education is NOT "Primary" or "Secondary" 


Yao) ooo Eilibern((mighestjeducation cans je(VPrimary" 7 Secondary.) ))) 

## # A tibble: 5 x 10 

## age sex weight kg highest education neighborhood 

++ <dbl> <chr> <dbl> <chr> <chr> 

## 1 55 Male 96 University Briqueterie 

## 2 23 Male 74 University Briqueterie 

## 3 28 Male 62 Doctorate Briqueterie 

## 4 38 Male 71 University Briqueterie 

## 5 54 Male 71 University Briqueterie 

## # .. with 5 more variables: occupation <chr>, 

Ht # is smoker <chr>, is pregnant <chr>, igg result <chr>, 
It is easier to read filter() statements as keep statements, to avoid 
confusion over whether we are filtering in or filtering out! 
So the code below would read: “keep respondents who are under 18 or 
who weigh less than 30kg”. 

KEY POINT - 

IF yao oro) ftilter(age < 18| verghe kg < 30) 
And when we wrap conditions in ! (), we can then read filter () 
statements as drop statements. 
So the code below would read: “drop respondents who are under 18 or 
who weigh less than 30kg”. 
yala ae aee Wane < Ss | | eee e s S0) 

PRACTICE 
From yao, drop respondents who live in the Tsinga or Messa 
neighborhoods. 

(in RMD) 
NA values 


The relational operators introduced so far do not work with NA. 


Let's make a data subset to illustrate this. 


10 


yao mini <= yao %>% 
select (sex, is pregnant) %>% 
slice(1,11,50,2) ## custom row order 


yao mini 


# A tibble: 4 x 2 
sex is pregnant 
<chr> <chr> 
Female No 

Female No respons 
Female Yes 

Male <NA> 


b W P H 


In yao mini, the last respondent has an NA for the is_pregnant column, because he is 
male. 


Trying to select this row using == NA will not work. 


yao mini %>% filter(is pregnant == NA) ## does not work 


## # A tibble: 0 x 2 
## # .. with 2 variables: sex <chr>, is pregnant <chr> 


yao mini %>% filter(is pregnant == "NA") ## does not work 


## # A tibble: 0 x 2 
## # .. with 2 variables: sex <chr>, is pregnant <chr> 


This is because NA is a non-existent value. So R cannot evaluate whether it is “equal to” or 
“not equal to” anything. 


The special function is.na() is therefore necessary: 


## keep rows where ‘is pregnant’ is NA 
yao (mini s>5 filter(is.na (is pregnant) ) 


## # A tibble: 1 x 2 
Ht sex is pregnant 
HF <chr> <chr> 

## 1 Male <NA> 


This function can be negated with !: 


## drop rows where is pregnant is NA 
yao mini %>% filter(!is.na(is_ pregnant) ) 


## # A tibble: 3 x 2 


++ sex is_pregnant 
## <chr> <chr> 

## 1 Female No 

## 2 Female No response 
## 3 Female Yes 


SIDE NOTE 


For tibbles, RStudio will highlight NA values bright red to distinguish them 
from other values: 


# A tibble: 5 x 3 
age sex is_pregnant 

<dbl> <chr> <chr> 

32 Male NA 

23 Female Yes 

35 Male NA 

31 Female No 

17 Female No response 


uw & Ww N p 


A common error with NA 


SIDE NOTE 


PRACTICE 


PRACTICE 


NA values can be identified but any other encoding such as "NA" or 
"NaN", which are encoded as strings, will be imperceptible to the 
functions (they are strings, like any others). 


From the yao dataset, keep all the respondents who had missing records 
for the report of their smoking status. 


For some respondents the respiration rate, in breaths per minute, was 
recorded in the respiration frequency column. 


PRACTICE 
A From yaounde, drop those with a respiration frequency under 20. Think 


about NAs while doing this! You should avoid also dropping the NA values. 
(in RMD) 


Wrap Up! 


Now you know the two essential verbs to select () columns and to filter () rows. This 
way you keep the variables you are interested in by selecting your columns and you keep 
the data entries you judge relevant by filtering your rows. 


But what about modifying, transforming your data? We will learn about this in the next 
lesson. See you there! 


select() 
>E 
E 
= filter() 
Basic Fag > 
Wrangling HEE aa 


Coming soon ! 


Fig: Basic Data Wrangling: select () and filter(). 


Contributors 


The following team members contributed to this lesson: 


À LAURE VANCAUWENBERGHE 


Data analyst, the GRAPH Network 
A firm believer in science for good, striving to ally programming, health 
and education 


Q ANDREE VALLE CAMPOS 


R Developer and Instructor, the GRAPH Network 


Motivated by reproducible science and education 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


References 


Some material in this lesson was adapted from the following sources: 


e Horst, A. (2021). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original 
work published 2020) 


e Subset rows using column values—Filter. (n.d.). Retrieved 12 January 2022, from 
https://dplyr.tidyverse.org/reference/filter.html 


Artwork was adapted from: 


e Horst, A. (2021). R & stats illustrations by Allison Horst. https://github.com 
/allisonhorst/stats-illustrations (Original work published 2018) 


Lesson notes | Mutating columns 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


ION 6:44 Ain eave bee ok ee Sats eee aes Aone eek eee k em a eee aes E ane 
Learning CO CINGS fans or nea heeds neasi as Oe OE. Ode RE OOS ROE BEERS oh ede 
ee cc: ees Had ares anne Coe dane 24-4 eee sob E Fs hee oeahes 64444-5484 ok 
Tare pias ass epee te hed BS Bh kd BAe DO es ee Ge Ae EAE E OER SOREN 
Introdücing motate () sciceceekoerelee dea ch2dhe des i be MORRO ERO ROSA EES TREE RES 
Creating a Boolean variaDIe «6.4 v0w an ged edhe deeded hae 8G 28 ae tarona Re ORS ORS 4S Ew S ES 
Creating ainumeric variable based On a formula ..<. 26.404 <4 ars dave dw ade wed wed oes 
Changing a variables type .... 1... 2... eee 

miegenr as- Integer oo. 5 645.4 SUES itai siana ae EO EAGE AAA DLD ERA PREC REIR EDS 
OD: 2 ice ob ee oes ee ee ot es A Sead re be ba pee ee ee cs ee ee Se 


Intro 


You now know how to keep or drop columns and rows from your dataset. Today you will 
learn how to modify existing variables or create new ones, using the mutate () verb from 
{dplyr}. This is an essential step in most data analysis projects. 


Let’s go! 


| | E 
> ia 

E 

E 

2 mutate() 


Fig: the mutate () verb. 


Learning objectives 


1. You can use the mutate () function from the {dplyr} package to create new 
variables or modify existing variables. 


2. You can create new numeric, character, factor, and boolean variables 


o, 


Packages 


This lesson will require the packages loaded below: 


if (!require (pacman)) install.packages ("pacman") 
pacman::p load (here, 
a JEL COE; 
tidyverse) 


Datasets 


In this lesson, we will again use the data from the COVID-19 serological survey conducted 
in Yaounde, Cameroon. Below, we import the dataset yaounde and create a smaller 


subset called yao. Note that this dataset is slightly different from the one used in the 
previous lesson. 


yaounde <- read_csv(here::here('data/yaounde data.csv') ) 


## a smaller subset of variables 

yao <- yaounde %>% select (date surveyed, 
age, 
weight kg, height cm, 
symptoms, is_ smoker) 


yao 


date_surv.. age weight_kg height_cm symptoms is_smoker 


2020-10-22 45 95 169 Muscle pain Non-smoker 
2020-10-24 55 96 185 No sympto... Ex-smoker 
2020-10-24 23 74 180 No sympto... Smoker 
2020-10-22 20 70 164 Rhinitis--Sn... Non-smoker 
2020-10-22 55 67 147 No sympto... Non-smoker 
2020-10-25 17 65 162 Fever--Cou... Non-smoker 
2020-10-25 13 65 150 Sneezing Non-smoker 
2020-10-24 28 62 173 Headache Non-smoker 
2020-10-24 30 73 170 Fever--Rhin... Non-smoker 
2020-10-24 13 56 153 No sympto... Non-smoker 
1-10 of 971 rows Previous 1 2 3 4 5 .. 98 Next 


We will also use a dataset from a cross-sectional study that aimed to determine the 
prevalence of sarcopenia in the elderly population (>60 years) in in Karnataka, India. 
Sarcopenia is a condition that is common in elderly people and is characterized by 
progressive and generalized loss of skeletal muscle mass and strength. The data was 
obtained from Zenodo here, and the source publication can be found here. 


Below, we import and view this dataset: 
sarcopenia <- read csv(here::here('data/sarcopenia elderly.csv')) 


sarcopenia 


number 
7 
8 
9 
12 
13 
19 
45 
46 
51 
56 


age 
60.8 
72.3 
62.6 
72 
60.1 
60.6 
60.1 
60.2 
63 
60.4 


1-10 of 239 rows 


age_group 
Sixties 
Seventies 
Sixties 
Seventies 
Sixties 
Sixties 
Sixties 
Sixties 
Sixties 


Sixties 


sex_male... 


— 


O 
(0) 
O 


Previous 1 


marital_s... 


married 
married 
married 
widow 
married 
married 
widower 
married 
married 


married 


3 4 


height_m... 


T57 
1.65 
1.59 
1.473 
1.55 
1.422 
1.68 
1.8 
1.6 
1.6 


wei 


54.5 


Introducing mutate () 


The mutate () function. (Drawing adapted from Allison Horst) 


We use dplyr::mutate () to create new variables or modify existing variables. The 
syntax Is quite intuitive, and generally looks like di 


f S>%S mutate (new column name 


what _it contains). 
Let’s see a quick example. 


The yaounde dataset currently contains a column called height cm, which shows the 
height, in centimeters, of survey respondents. Let’s create a data frame, yao height, 
with just this column, for easy illustration: 


yao height <- yaounde %>% select (height_cm) 
yao height 


# A tibble: 5 x 1 
height_cm 
<db1> 
169 
185 
180 
164 
147 


Owe WN FR 


What if you wanted to create a new variable, called height meters where heights are 
converted to meters? You can use mutate () for this, with the argument height meters 
= height _cm/100: 


yao _ height %>% 
mutate (height_meters = height_cm/100) 


# A tibble: 5 x 2 
height _cm height _meters 
<db1> <db1> 
1 169 1.69 
2 185 1.85 
3 180 1.8 
4 164 1.64 
5 147 1.47 


Great. The syntax is beautifully simple, isn't it? 


SIDE NOTE 


Sometimes it is helpful to think of data manipulation functions in the 
context of familiar spreadsheet software. Here is what the R command 
mutate (height m = height cm/100) would be equivalent to in Google 
Sheets: 


ke SN FS SS S| A 8 A SS M M 


A B 
height_cm | l 
169 
185 
180 
164 
147 
162 
150 


— 


SIDE NOTE 


ee a a 
© INICIA AION 


n 
| 


Now, imagine there was a small error in the equipment used to measure respondent 
heights, and all heights are 5cm too small. You therefore like to add 5cm to all heights in 
the dataset. To do this, rather than creating a new variable as you did before, you can 
modify the existing variable with mutate: 


yao_ height %>% 
mutate (height _cm = height_cm + 5) 


+ 


# A tibble: 5 x 1 
height _cm 
<db1> 
174 
190 
185 
169 
152 


Ow WN Fr 


Again, very easy to do! 


The sarcopenia data frame has a variable weight_kg, which contains 
respondents’ weights in kilograms. Create a new column, called 

PRACTICE weight grams, with respondents’ weights in grams. Store your answer 
inthe Q weight _to_g object. (1 kg equals 1000 grams.) 


(in RMD) # Complete the code with your answer: 
Olweight itono ks 
sarcopenia %>% 


Hopefully you now see that the mutate function is quite user-friendly. In theory, we could 
end the lesson here, because you now know how to use mutate () &. But of course, the 
devil will be in the details—the interesting thing is not mutate () itself but what goes 
inside the mutate () Call. 


The rest of the lesson will go through a few use cases for the mutate () verb. In the 
process, we'll touch on several new functions you have not yet encountered. 


—ÁŮ— 


Creating a Boolean variable 


You can use mutate () to create a Boolean variable to categorize part of your population. 


Below we create a Boolean variable, is child which is either TRUE if the subject is a child 
Or FALSE if the subject is an adult (first, we select just the age variable so it’s easy to see 
what is being done; you will likely not need this pre-selection for your own analyses). 


yao S>% 
select(age) %>% 
Mibleshess (aks) Canile = ege <= e) 


# A tibble: 5 x 2 
age is_child 

<dbl> <lgl> 
45 FALSE 
55 FALSE 
23 FALSE 
20 FALSE 
55 FALSE 


Owe WN H= 


The code age <= 18 evaluates whether each age is less than or equal to 18. Ages that 
match that condition (ages 18 and under) are TRUE and those that fail the condition are 
FALSE. 


Such a variable is useful to, for example, count the number of children in the dataset. The 
code below does this with the janitor::tabyl() function: 


yao S>% 
mberesi emel = eves <= e) ore 
welor Mals CILE) 


## is child n percent 
Ht FALSE 662 0.6817714 
++ TRUE 309 0.3182286 


F 
£ 
F 
E 


You can observe that 31.8% (0.318...) of respondents in the dataset are children. 


Let's see one more example, since the concept of Boolean variables can be a bit 
confusing. The symptoms variable reports any respiratory symptoms experienced by the 
patient: 


yao S>% 
select (symptoms) 


A tibble: 5 x 1 

symptoms 

<chr> 

L Muscle pain 

2 No symptoms 

3 No symptoms 

4 Rhinitis--Sneezing--Anosmia or ageusia 


5 No symptoms 


You could create a Boolean variable, called has _no_ symptoms, that is set to TRUE if the 
respondent reported no symptoms: 


yao S>% 
select (symptoms) %>% 
mutate (has no symptoms = symptoms == "No symptoms") 


# A tibble: 5 x 2 
symptoms has no symptoms 
<chr> <lgl> 

1 Muscle pain FALSE 

2 No symptoms TRUE 

3 No symptoms TRUE 

4 Rhinitis--Sneezing--Anosmia or ageusia FALSE 

5 No symptoms TRUE 


Similarly, you could create a Boolean variable called has_any symptoms that is set to 
TRUE if the respondent reported any symptoms. For this, you'd simply swap the 
symptoms == "No symptoms" code for symptoms != "No symptoms": 


yao S>% 
select (symptoms) %>% 
mutate (has any symptoms = symptoms != "No symptoms") 


## # A tibble: 5 x 2 


++ symptoms has_any symptoms 
## <chr> <lgl> 

## 1 Muscle pain TRUE 

## 2 No symptoms FALSE 

## 3 No symptoms FALSE 


10 


## 4 Rhinitis--Sneezing--Anosmia or ageusia TRUE 
## 5 No symptoms FALSE 


Still confused by the Boolean examples? That’s normal. Pause and play with the code 
above a little. Then try the practice question below 


Women with a grip strength below 20kg are considered to have low grip 
strength. With a female subset of the sarcopenia data frame, adda 
variable called low grip strength that is TRUE for women with a grip 
strength < 20 kg and FALSE for other women. 


# Complete the code with your answer: 
PRACTICE Q women low grip strength <- 
sarcopenia %>% 
l eenas melle il renee (0) == (0) 47 ee e Genie eS Cee 
to only women 
# mutate code here 


(in RMD) 


What percentage of women surveyed have a low grip strength according 
to the definition above? Enter your answer as a number without quotes 
(e.g. 43.3 or 12.2), to one decimal place. 


Q prop women low grip strength <- YOUR ANSWER HERE 


Creating a numeric variable based on a formula 


Now, let’s look at an example of creating a numeric variable, the body mass index (BMI), 
which a commonly used health indicator. The formula for the body mass index can be 
written as: 


weight(kilograms) 


BMI =—— 
height(meters)? 


You can use mutate () to calculate BMI in the yao dataset as follows: 


yao S>% 
select (weight_kg, height _cm) %>% 


# first obtain the height in meters 
müutate(height meters = height cm/100) ses 


# then use the BMI formula 
mutate (bmi = weight_kg / (height_meters) *2) 


## # A tibble: 5 x 4 
++ weight_kg height_cm height _ meters bmi 


## <db1> <db1> <dbl> <db1> 
## 1 95 169 1.69 33.3 
## 2 96 185 1.85 28.0 
## 3 74 180 1.8 22.58 
#t 4 70 164 1.64 26.0 
## 5 67 147 1.47 31.0 


Let’s save the data frame with BMls for later. We will use it in the next section. 


yao_bmi <- 
yao S>% 
select (weight_kg, height _cm) %>% 
# first obtain the height in meters 
mutate (height meters = height _cm/100) %>% 
# then use the BMI formula 
mutate (bmi = weight kg / (height_meters) *2) 


Appendicular muscle mass (ASM), a useful health indicator, is the sum of 
muscle mass in all 4 limbs. It can predicted with the following formula, 
called Lee's equation: 


ASM(kg) = (0.244 x weight(kg)) + (7.8 x height(m)) + (6.6 x sex) — (0.098 x a, 


PRACTICE 
The sex variable in the formula assumes that men are coded as 1 and 


women are coded as O (which is already the case for our sarcopenia 
(in RMD) dataset.) The- 4.5 at the end Is a constant used for Asians. 


Calculate the ASM value for all individuals in the sarcopenia dataset. 
This value should be in a new column called asm 


# Complete the code with your answer: 
Ovesmmcalctla ton =<— 

sarcopenia # 

# 


Changing a variable’s type 


In your data analysis workflow, you often need to redefine variable types. You can do so 
with functions like as.integer(),as.factor(),as.character() and as.Date() 
within your mutate () call. Let’s see one example of this. 


Integer: as. integer 


as.integer() converts any numeric values to integers: 


yao_bmi %>% 
mutace (omil integer — ase aneeger (om) 


# A tibble: 5 x 5 
weight _kg height _cm height meters bmi bmi_integer 
<db1> <db1l> <dbl> <dbl> <int> 
1 95 169 1.69 33.3 33 
2 96 185 1.85 28.0 28 
3 74 180 1.8 22.8 22 
4 70 164 1.64 26.0 26 
5 67 147 1.47 31.0 31 


Note that this truncates integers rather than rounding them up or down, as you might 
expect. For example the BMI 22.8 in the third row is truncated to 22. If you want rounded 
numbers, you can use the round function from base R 


PRO TIP : : A ; 
x Using as.integer() ona factor variable is a fast way of encoding 


x strings into numbers. It can be essential to do so for some machine 
X learning data processing. 


yao_ bmi %>% 
mutate (mif integer = “as integer (bmi!) 7 
bmi rounded = round(bmi) ) 


# A tibble: 5 x 6 
weight _kg height _cm height meters bmi bmi_integer 
<db1> <db1> <dbl> <dbl1> <int> 
1 95 169 1.69 33.3 33 
2 96 185 1.85 28.0 28 
3 74 180 1.8 22.8 22 
4 70 164 1.64 26.0 26 


Pa as SSE PES ESSE EES SSPE SE EEE eee ee | 


SIDENOTE The base R round () function rounds “half down”. That is, the number 3.5, 
: -| for example, is rounded down to 3 by round (). This is weird. Most people 
expect 3.5 to be rounded up to 4, not down to 3. So most of the time, 
you'll actually want to use the round half up () function from janitor. 


| 
| 
| 
| 
| 
| 
| 
| 
| 
Ba 


ee ee ee a al 


CHALLENGE 
< In future lessons, you will discover how to manipulate dates and how to 
convert to a date type using as.Date(). 
Use as_integer () to convert the ages of respondents in the 
sarcopenia dataset to integers (truncating them in the process). This 
Sei aaa ile go in a new column called age_integer 
(in RMD) # Complete the code with your answer: 
O cee Eegen “<= 
sarcopenia # 
# 
Wrap up 


As you can imagine, transforming data is an essential step in any data analysis workflow. 
It is often required to clean data and to prepare it for further statistical analysis or for 
making plots. And as you have seen, it is quite simple to transform data with dplyr’s 
mutate () function, although certain transformations are trickier to achieve than others. 


Congrats on making it through. 
But your data wrangling journey isn’t over yet! In our next lessons, we will learn how to 


create complex data summaries and how to create and work with data frame groups. 
Intrigued? See you in the next lesson. 


Bn  select() 
= >E 
E 5 
ial E filter() 
Basic nmm > SHH 
Wrangling azg 
B mi E ae mutate() 
! Ean HEEE 
> E 


Fig: Basic Data Wrangling with select (), filter (), and mutate (). 


Contributors 


The following team members contributed to this lesson: 


LAURE VANCAUWENBERGHE 


Data analyst, the GRAPH Network 
A firm believer in science for good, striving to ally programming, health 
and education 


Q ANDREE VALLE CAMPOS 


R Developer and Instructor, the GRAPH Network 
Motivated by reproducible science and education 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


References 


Some material in this lesson was adapted from the following sources: 


e Horst, A. (2022). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original 
work published 2020) 


« Create, modify, and delete columns — Mutate. (n.d.). Retrieved 21 February 2022, 
from https://dplyr.tidyverse.org/reference/mutate.html 


e Apply a function (or functions) across multiple columns — Across. (n.d.). Retrieved 
21 February 2022, from https://dplyr.tidyverse.org/reference/across.html 


Artwork was adapted from: 


¢ Horst, A. (2022). R & stats illustrations by Allison Horst. https://github.com 
/allisonhorst/stats-illustrations (Original work published 2018) 


Other references: 


e Lee, Robert C, ZiMian Wang, Moonseong Heo, Robert Ross, lan Janssen, and Steven B 
Heymsfield. “Total-Body Skeletal Muscle Mass: Development and Cross-Validation of 
Anthropometric Prediction Models.” The American Journal of Clinical Nutrition 72, 
no. 3 (2000): 796-803. https://doi.org/10.1093/ajcn/72.3.796. 


Lesson notes | Conditional mutating 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


WAS Gate ieee kee Say aie ie wae ae ee, eee Ree oe eee aes 
Learnino COCHIN oeg ai aie Oe 98s Oe FES eG OEE A PE OOS ORES aA 
REKE I 2 ak pee BG ao dh Ra ey ae Ee BAG He E oR ee Peed aoa gee eek ook 
DISES ous ak eee a a eh ee SS Shoe ee ie bed ed FEO n de OES S E E 
Reminder: relational operators (comparators) iNR........... 0.2... ee eee 
lintrodüction tocase Whe () . crake chs ee eGea vn iret ee dees dedeeeae eh eee DEN Ea ee 
The TRUE default APQUINIEN «4 eos heres ene he Hb ENG KE OIE EE SS EEO ee RES SOS 
Matching TAS Wit 2S. Gat) ohne ska a hoo RO 54 4RS GHEE OEE RG SRA a DRS TE 
Keeping default values Of a variable ..... cence dk cee dee ee ee bee eee ee a 
Multiple conditions on a single variable ............. 0... 0c eee 
Mültiple CONCIHONS Of multiple varia DleS s seccare tartasa RAW Ea OR ee eee RR Oe eee 
Order of priority of conditions in case when () =srrerserenensersesstesssrinse sad ee 

Overlapping conditions within case When t) sprs 444.44.4064 6440 SEC S49SS Hie oR pitoi 
Binary conditions»dplyr:sif else] cigs oh esd ge satesseeedes¥ eee cn egeasende 
Wap UD goat gies tae ertam Ea Hae Gd a AE edd ee a he dod ded Hina eve evden aul aul Meds Unease 


Introduction 


In the last lesson, you learned the basics of data transformation using the {dplyr} function 
mutate (). 


In that lesson, we mostly looked at g/oba/ transformations; that is, transformations that 
did the same thing to an entire variable. In this lesson, we will look at how to conditionally 
manipulate certain rows based on whether or not they meet defined criteria. 


For this, we will mostly use the case _when() function, which you will likely come to see as 
one of the most important functions in {dplyr} for data wrangling tasks. 


Let's get started. 


Fig: the case _when() conditions 


Learning objectives 


1. You can transform or create new variables based on conditions using 
dplyr::case when () 


2. You know how to use the TRUE condition in case _when() to match unmatched 
cases. 


3. You can handle NA values in case_when() transformations. 


4. You understand how to keep the default values of a variable in a case when () 
formula 


5. You can write case _when() conditions involving multiple comparators and multiple 
variables. 


6. You understand case _when() conditions priority order. 


7. You can use dplyr::if _else() for binary conditional assignment. 


Packages 
This lesson will require the tidyverse suite of packages: 


if(!require(pacman)) install.packages ("pacman") 
pacman: :p load(tidyverse) 


Datasets 


In this lesson, we will again use data from the COVID-19 serological survey conducted in 
Yaounde, Cameroon. 


# Import and view the dataset 
yaounde <- 


read_csv(here::here('data/yaounde data.csv')) %>% 
## make every 5th age missing 
mutate (age = case when(row_number() tins seq(5, 900, 


TRUE ~ age)) %>% 
## rename the age variable 
rename (age years = age) %>% 


# drop the age category column 
Ssellecrl(sage category) 


yaounde 


By = S INA ecl y; 


Note that in the code chunk above, we slightly modified the age column, artificially 
introducing some missing values, and we also dropped the age_category column. This is 
to help illustrate some key points in the tutorial. 


For practice questions, we will also use an outbreak linelist of 136 cases of influenza A 
H7N9 from a 2013 outbreak in China. This is a modified version of a dataset compiled by 
Kucharski et al. (2014). 


# Import and view the dataset 
u_linelist <- read_csv(here::here('data/flu_h7n9 china 2013.csv')) 


£] 


E 


U el eie 


Reminder: relational operators (comparators) in R 


Throughout this lesson, you will use a lot of relational operators in R. Recall that relational 
operators, sometimes called “comparators”, test the relation between two values, and 
return TRUE, FALSE or NA. 


A list of the most common operators is given below: 


Operator is TRUE if 

A<B A is less than B 

A<=B_ Ais less than or equal to B 
A>B A is greater than B 


A>=B Ais greater than or equal to B 


== A is equal to B 
Al=B A is not equal to B 
A %in% B Ais an element of B 


E 


Introduction to case_when () 


To get familiar with case_when (), let's begin with a simple conditional transformation 
on the age_years column of the yaounde dataset. First we subset the data frame to just 
the age_years column for easy illustration: 


yaounde_ age <- 
yaounde %>% 
select (age_years) 


yaounde_ age 

Now, using case_when(), we can make a new column, called “age_group’, that has the 
value “Child” if the person is below 18, and “Adult” if the person is 18 and up: 

yaounde age %>% 


mutate(age group = case when(age_ years < 18 ~ "Child™, 
age yedira m= la = ayohbiilie’”)) ) 


The case_when () syntax may seem a bit foreign, but it is quite simple: on the left-hand 
side (LHS) of the ~ sign (called a “tilde”), you provide the condition(s) you want to 
evaluate, and on the right-hand side (RHS), you provide a value to put in if the condition is 
true. 


So the statement case when(age years < 18 ~ "Child", age years >= 18 ~ 
"Adult") can be read as: “if age_years is below 18, input ‘Child’, else if age_years Is 
greater than or equal to 18, input ‘Adult’. 


Formulas, LHS and RHS 


Each line of a case _when() call is termed a “formula” or, sometimes, a 
“two-sided formula”. And each formula has a left-hand side (abbreviated 
VOCAB LHS) and right-hand side (abbreviated RHS). 


==] For example, the code age_ years < 18 ~ "Child" is a “formula’, its 
LHS is age_years < 18 while its RHS is "Child". 


You are likely to come across these terms when reading the 
documentation for the case _when() function, and we will also refer to 
them in this lesson. 


After creating a new variable with case _when (), it is a good idea to inspect it thoroughly 
to make sure it worked as intended. 


To inspect the variable, you can pipe your data frame into the View() function to view it 
in spreadsheet form: 


yaounde age %>% 


micareiagergr oup = case mwhenli(agem yearns i lom ehsleiyy 
eGe vecna == ile; = Merita o7 
View () 


This would open up a new tab in RStudio where you should manually scan through the 
new column, age_group and the referenced column age_years to make sure your 
case when() statement did what you wanted it to do. 


You could also pass the new column into the taby1() function to ensure that the 
proportions “make sense”: 


yaounde age %>% 
MULES (ageRoroups —scaseuwheni(agem yc arse miem su eineeely, 
EMS Weekes) SS ils) 2 WvohuulieM ))) sees 
tabyl (age_group) 


With the flu_linelist data, make a new column, called age_group, 
that has the value “Below 50” for people under 50 and “50 and above” for 
people aged 50 and up. Use the case _when() function. 


# Complete the code with your answer: 
ORagesgroupy <= 
PRACTICE flu_linelist %>% 


mutate (age_group = ) 


(in RMD) Out of the entire sample of individuals in the £lu_linelist dataset, 
what percentage are confirmed to be below 60? (Repeat the above 
procedure but with the 60 cutoff, then call tabyl () on the age group 
variable. Use the percent column, not the valid percent column.) 


# Enter your answer as a number without quotes: 
ORagc Groups percentage <= YOUR TANSWER. HERE 


The TRUE default argument 


In acase when() statement, you can use a literal TRUE condition to match any rows not 
yet matched with provided conditions. 


For example, if we only keep only the first condition from the previous example, 
age years < 18, and define the default value to be TRUE ~ "Not child" then all 
adults and NA values in the data set will be labeled "Not child" by default. 


yaounde age %>% 
mocatelage group —sCascuwheni(agemycars™ ler eh 
TRUET WIN(one yellalstlkel")) )) 


This TRUE condition can be read as “for everything else...”. 


So the full case _when() statement used above, age years < 18 ~ "Child", TRUE ~ 
"Not child", would then be read as: “if age is below 18, input ‘Child’ and for everyone 
else not yet matched, input ‘Not child’. 


It is important to use TRUE as the fina/ condition in case_when(). If you 
use it as the first condition, it will take precedence over all others, as seen 


here: 
WATCH OUT 
yaounde age %>% 
mMUEate (agelgroup — case _when(@RUE = Not chaldy, 
ace wears < La = Ciwidey 


As you can observe, all individuals are now coded with “Not child”, 
because the TRUE condition was placed first, and therefore took 
precedence. We will explore the issue of precedence further below. 


Matching NA’s with is.na() 


We can match missing values manually with is.na(). Below we match NA ages with 
is.na() and set their age group to “Missing age”: 


yaounde age %>% 


mutate (age group = case when(age_ years < 18 ~ "Child", 
age vedra == la = Uaolbilie’ 
USE nay (age yvsars) UMasisaime acet) 


As before, using the flu_linelist data, make a new column, called 
age group, that has the value “Below 60” for people under 60 and “60 
and above" for people aged 60 and up. But this time, also set those with 
missing ages to “Missing age”. 


PRACTICE 


(in RMD) 
# Complete the code with your answer: 


ORagergroupmnas  <— 
ie My neia west 


The gender column of the flu_linelist dataset contains the values “f”, 
“m” and NA: 


Je Nil bael aea 
PRACTICE tabyl (gender) 
Recode “f”, “m” and NA to “Female”, “Male” and “Missing gender” 


(in RMD) respectively. You should modify the existing gender column, not create a 
new column. 


# Complete the code with your answer: 
Orgenderynccode <— 
ie Jel dlabavedbalisie see 


(a 


Keeping default values of a variable 


The right-hand side (RHS) of acase_when() formula can also take in a variable from your 
data frame. This is often useful when you want to change just a few values in a column. 


Let’s see an example with the highest education column, which contains the highest 
education level attained by a respondent: 


yaounde educ <- 

yaounde %>% 

select (hughestmeducda eon) 
yaounde_educ 


Below, we create a new column, highest _educ_recode, where we recode both 
“University” and “Doctorate” to the value “Post-secondary”: 


yaounde educ %>% 
MUEAee(Mmighesi=ReduciRecode. — 
case_when ( 
highest educate tony ounce mien aey DOCTORES = WOE 


secondary" 


)) 


It worked, but now we have Nas for all other rows. To keep these other rows at their 
default values, we can add the line TRUE ~ highest education (with a variable, 
highest education, on the right-hand side of a formula): 


yaounde educ %>% 
mutate (highest educ_recode = 
case_when ( 
hui ehest (Xolblceicalolny sino C ((MUlisiwicuesislieye , S DOCECOr reies H = VPOSES 


secondary", 
TRUE ~ highest education 


)) 


Now the case when() statement reads: ‘If highest education is “University” or 
“Doctorate”, input “Post-secondary”. For everyone else, input the value from 


highest education’. 


Above we have been putting the recoded values in a separate column, 
highest _educ_recode, but for this kind of replacement, it is more common to simply 
overwrite the existing column: 


yaounde educ %>% 
MUbate (highest education = 
case_when ( 
InaLGlovSsie GYoWbeesGla sine (Uaioe y i, WV Deyerroicehee)) = WEOE- 
secondary", 
TRUE ~ highest education 


)) 


We can read this last case _when() statement as: ‘If highest education is “University” or 
“Doctorate”, change the value to “Post-secondary”. For everyone else, /eave in the value 


from highest education’. 


Using the flu_linelist data, modify the existing column outcome by 


PRACTICE : E T k 
A replacing the value “Recover” with “Recovery”. 
(in RMD) # Complete the code with your answer: 


ORrecCodembecovieriya<— 
ie Mil oae eee 


PRACTICE 
(We know it’s a lot of code for such a simple change. Later you will see 
easier ways to do this.) 


(in RMD) 
Avoiding long code lines As you start to write increasingly complex 
case when() statements, it will become helpful to use line breaks to 
avoid long lines of code. 
To assist with creating line breaks, you can use the {styler} package. 
Install it with pacman: :p_load (styler). Then to reformat any piece of 
code, highlight the code, click the “Addins” button in RStudio, then click 
on “Style selection’: 
PRO TIP sais 
x So i [ans] 
N 9 STYLER Q [sye | ] 
3 Set style 


Style selection N 


Style active fsieston ` 


Style active package 


Alternatively, you could highlight the code and use the shortcut Shift + 
Command/Control +A to use RStudio’s built-in code reformatter. 


Sometimes {styler} does a better job at reformatting. Sometimes the 
built-in reformatter does a better job. 


Multiple conditions on a single variable 


LHS conditions in case _when() formulas can have multiple parts. Let’s see an example of 
this. 


But first, we will inspire ourselves from what we learnt in the mutate () lesson and 
recreate the BMI variable. This involves first converting the height _cm variable to meters, 
then calculating BMI. 


yaounde BMI <- 
yaounde %>% 
mutate (height_m = height _cm/100, 
BMI = (weight_kg / (height_m)%2)) %>% 
select (BMI) 


yaounde_ BMI 


Recall the following BMI categories: 
e If the BMI is inferior to 18.5, the person is considered underweight. 
e A normal BMI is greater than or equal to 18.5 and less than 25. 
e An overweight BMI is greater than or equal to 25 and less than 30. 


e An obese BMI is BMI is greater than or equal to 30. 


The condition BMI >= 18.5 & BMI < 25 to define Normal weight is acompound 
condition because it has two comparators: >= and <. 


yacunde EMIK 
yaounde BMI %>% 
mutate (BMI classification = case _when(BMI < 18.5 ~'Underweight', 


BMI >= 18.5 & BMI < 25 ~ 'Normal 
weight', 

BMI >= 25 & BMI < 30 ~ 'Overweight', 

BMI >= 30 ~ 'Obese')) 


yaounde_ BMI 


Let’s use tabyl () to have a look at our data: 


yaounde BMI %>% 
tabyl (BMI_classification) 


But you can see that the levels of BMI are defined in alphabetical order from Normal 
weight to Underweight, instead of from lightest (Underweight) to heaviest (Obese). 
Remember that if you want to have a certain order you can make BMI classification 
a factor using mutate () and define its levels. 


yaounde BMI %>% 


MUjeaAcer( SMBH a Scie Carlon = ACTO (SB MIARCIEAS Saar cEkoOn mle e Ss —Cu(mObe sen, 
"Overweight", 
"Normal 
weight", 
"Underweight"))) %>% 


tabyl (BMI_ classification) 


WATCH OUT 


With compound conditions, you should remember to input the variable 
name everytime there is a comparator. R learners often forget this and 
will try to run code that looks like this: 


yaounde BMI %>% 

WATCH OUT mutate pees can es = eea Voen (ENI < als 25 
~'Underweaohit 7, 

BMI >= 18.5 & < 25 ~ 

"Normal weight', 

BMI >= 25 & < 30 ~ 

"Overweight', 


BMI >= 30 ~ 'Obese')) 


The definitions for the “Normal weight” and “Overweight” categories are 
mistaken. Do you see the problem? Try to run the code to spot the error. 


With the flu linelist data, make a new column, called adolescent, 
PRacTice§ that has the value “Yes” for people in the 10-19 (at least 10 and less than 
20) age group, and “No” for everyone else. 


(in RMD) # Complete the code with your answer: 
OkadolescentToroupingi 
mik dlalavebaisie, es 


Multiple conditions on multiple variables 


In all examples seen so far, you have only used conditions involving a single variable at a 
time. But LHS conditions often refer to multiple variables at once. 


Let’s see a simple example with age and sex in the yaounde data frame. First, we select 
just these two variables for easy illustration: 


yaounde age sex <- 
yaounde %>% 
select (age years, sex) 

yaounde age sex 


Now, imagine we want to recruit women and men in the 20-29 age group into two studies. 
For this we'd like to create a column, called recruit, with the following schema: 


e Women aged 20-29 should have the value “Recruit to female study” 


e Men aged 20-29 should have the value “Recruit to male study” 
e Everyone else should have the value “Do not recruit” 


To do this, we run the following case_when statement: 


yaounde age sex %>% 
mutate (recruit = case when ( 


Se == Penelles e ele veere == 20 E leis: Avec) <= 20 ~ ncieiebalie ee) ieCilel he 
SEUA 

Se = Melek & age yere >= 20 € doe vedra <> 29 ~ URGET E cO melle 
SEU 

TRUE EDO NOE ECCrUN Ei 


)) 


You could also add extra pairs of parentheses around the age criteria within each 
condition: 


yaounde age sex %>% 
morare recrute IC asicmwinomn( 


Bex SS enale i (lee yeca == 20) e 0e eara KS 29) MiNecieibhie wo) emelle 
Siewlelw yp 

aen = Mellet e Ge yecenrs = 20 E OE veera <— 29) =~ kocie To mele 
SEBEN, 

ARGE = WDO MOE CECU e 


)) 


This extra pair of parentheses does not change the code output, but it improves 
coherence because the reader can visually see that your condition is made of two parts, 
one for gender, sex == "Female", and another for age, (age_years >= 20 & 

age years <= 29). 


With the flu linelist data, make a new column, called recruit with 
the following schema: 


e Individuals aged 30-59 (at least 30, younger than 60) from the 
PRACTICE Jiangsu province should have the value “Recruit to Jiangsu study” 
e Individuals aged 30-59 from the Zhejiang province should have the 
value “Recruit to Zhejiang study” 


(in RMD) e Everyone else should have the value “Do not recruit” 


# Complete the code with your answer: 
Ofage province groupn | <— 
ie IG AakievSIakehie. oae 
mutate (recruit = ) 


14 


Order of priority of conditions in case when () 


Note that the order of conditions is important, because conditions listed at the top of 
your case _when() statement take priority over others. 


To understand this, run the example below: 


yaounde age sex %>% 


mutate (age group = case when(age years < 18 ~ "Child", 
age wedra < SO = orae aerun 
age years < 120) = VOlcEn evelbilie))) 


This initially looks like a faulty case_when () statement because the age conditions 
overlap. For example, the statement age years < 120 ~ "Older adult" (which reads 


“if age is below 120, input ‘Older adult”) suggests that anyone between ages 0 and 120 
(even a 1-year old baby!, would be coded as “Older adult”. 


But as you saw, the code actually works fine! People under 18 are still coded as “Child”. 
What’s going on? Essentially, the case _when() statement is interpreted as a series of 
branching logical steps, starting with the first condition. So this particular statement can 
be read as: “If age is below 18, input ‘Child’, and otherwise, if age is below 30, input ‘Young 
adult’, and otherwise, if age is below 120, input”Older adult”. 


This is illustrated in the schematic below: 


Order of evaluation with dplyr::case_when 


age_group = age age_group 
case_when( Child 

age < 18 ~ "Child", Young adult 

age < 30 ~ "Young adult", Young adult 

age < 120 ~ "Older adult", Older adult 

) 70 Older adult 

75 Older adult 


“Older adult” 


The GRAPH Courses ©)@ 


This means that if you swap the order of the conditions, you will end up with a faulty 
case when() Statement: 


yaounde age %>% 


mutate (age group = case _when(age_ years < 120 ~ "Older adult", 
ege years < 0 ~ PYE cetus, 
ege yeass < le = MCilalat el!) ) 


As you can see, everyone is coded as “Older adult”. This happens because the first 
condition matches everyone, so there is no one left to match with the subsequent 
conditions. The statement can be read “If age is below 120, input ‘Older adult’, and 
otherwise if age is below 30....” But there is no “otherwise” because everyone has already 
been matched! 


This is illustrated in the diagram below: 


A faulty case_when statement 
Code Logic Output 
age_group = age age_group 
case_when( 

“Older adult” 17 Older adult 
age < 120 ~ “Older adult", 19 Older adult 
age < 30 ~ “Young adult", 27 Older adult 
age < 18 ~ "Child", “Young adult” 30 Older adult 

F ; 70 Older adult 
) Everyone’s age is below 120, so 
no one is left to match after the 75 Older adult 
first condition. 4 
“Child” 
The remaining conditions are 
useless. The GRAPH Courses @)@) 


Although we have spent much time explaining the importance of the order of conditions, 
in this specific example, there would be a much clearer way to write this code that would 
not depend on the order of conditions. Rather than leave the age groups open-ended like 
this: 


age years < 120 ~ "Older adult" 
you should actually use c/osed age bounds like this: 
age years >= 30 & age years < 120 ~ "Older adult" 


which is read: “if age is greater than or equal to 30 and less than 120, input ‘Older adult”. 


With such closed conditions, the order of conditions no longer matters. You get the same 
result no matter how you arrange the conditions: 


A start wehbe "Older admit! \condit1on 
yaounde age %>% 


MmUEate(age group = (case when (age: years 
etoloulie 

age years 
erelblkie 


age years 


jf pHeewee Waele ViClapL iol Condi ETON 
yaounde age %>% 


MUcaES (age Group = case wheni(age years 
age years 

euehblkie 
age years 

aami EL) 


Nice and clean! 


So why did we spend so much time explaining the importance of condition order if you 
can simply avoid open-ended categories and not have to worry about condition order? 


>= 30) © age vaata < T20 ~ Oldar 
== l9 & age vars < 30) =~ MNOME) 
== Oi age years < lG 2 Claas kelt) 
= 0) e Giles yaara =~ ds} = PEnio, 
== l9 & age vedre < 30) = WiKolibliatey 
== 30 e Age vars < 120 =~ MOikeleic 


One reason is that understanding condition order should now help you see why it is 


important to put the TRUE condition as the final line in your case_when () statement. The 
TRUE condition matches every row that has not yet been matched, so if you use it first in 


the case_when () , it will match everyone! 


The other reason is that there are certain cases where you may want to use open-ended 
overlapping conditions, and so you will have to pay attention to the order of conditions. 


Let’s see one such example now: identifying COVID-like symptoms. Note that this is 


somewhat advanced material, likely a bit above your current needs. We are introducing it 


now so you are aware and can stay vigilant with case_when () in the future. 


Overlapping conditions within case_when () 


We want to identify COVID-like symptoms in our data. Consider the symptoms columns in 


the yaounde data frame, which indicates which symptoms were experienced by 


respondents over a 6-month period: 


yaounde %>% 
selec (Sremit nonca (Aen A) 


We would like to use this to assess whether a person may have had COVID, partly following 


guidelines recommended by the WHO. 


e Individuals with cough are to be classed as “possible COVID cases” 


e Individuals with anosmia/ageusia (loss of smell or loss of taste) are to be classed as 


“probable COVID cases”. 


Now, keeping these criteria in mind, consider an individual, let’s call her Osma, who has 


cough AND anosmia/ageusia? How should we classify Osma? 


She meets the criteria for “possible COVID” (because she has cough), but she a/so meets 
the criteria for “probable COVID” (because she has anosmia/ageusia). So which group 
should she be classed as, “possible COVID” or “probable COVID"? Think about it for a 
minute. 


Hopefully you guessed that she should be classed as a “probable COVID case”. “Probable” 
is more likely than “Possible”; and the anosmia/ageusia symptom is more significant than 
the cough symptom. One might say that the criterion for “probable COVID™ has a higher 
specificity or a higher precedence than the criterion for “possible COVID”. 


Therefore, when constructing a case _when() statement, the “probable COVID” condition 
should also take higher precedence—it should come first in the conditions provided to 
case when (). Let's see this now. 


First we select the relevant variables, for easy illustration. We also identify and slice () 
specific rows that are useful for the demonstration: 


yaounde symptoms slice <- 
yaounde %>% 
select (symp cough, symp anosmia or ageusia) %>% 
# slice of specific rows useful for demo 
# Once you find the right code, you would remove this slice 
Hee (S24, Will, C25, sil ) 


yaounde symptoms slice 


Now, the correct case _when() statement, which has the “Probable COVID™ condition 
first: 


yaounde symptoms slice %>% 


MUbaAte (Covad seaeus = casey wien ( 
evia emnosmia toe asusila == ASN = Wie icioloyellolliey (WAND 
Syl. COl == “est  WPoysisalells: COVED 


)) 


This case _when() statement can be read in simple terms as ‘If the person has 
anosmia/ageusia, input “Probable COVID”, and otherwise, if the person has cough, input 
“Possible COVID”’. 


Now, spend some time looking through the output data frame, especially the last three 
individuals. The individual in row 2 meets the criterion for “Possible COVID” because they 
have cough (symp cough == “Yes"), and the individual in row 3 meets the criterion for 
“Probable COVID” because they have anosmia/ageusia (symp anosmia or ageusia == 
"Yes"), 


The individual in row 4 is Osma, who both meets the criteria for “possible COVID” and for 
“probable COVID™. And because we arranged our case _when() conditions in the right 
order, she is coded correctly as “probable COVID”. Great! 


But notice what happens if we swap the order of the conditions: 


yaounde symptoms slice %>% 


mutate (covid_status = case when ( 
Svo COUGH == Me Si ~ EOS |S Mole m COVED Ly, 
sympfanosmi akoro geu siai- Mres UPictoloyellolicy (CIOL 


)) 


Oh no! Osma in row 4 is now misclassed as “Possible COVID” even though she has the 
more significant anosmia/ageusia symptom. This is because the first condition 

symp cough == "Yes" matched her first, and so the second condition was not able to 
match her! 


So now you see why you sometimes need to think deeply about the order of your 
case_when () conditions. It is a minor point, but it can bite you at unexpected times. 
Even experienced analysts tend to make mistakes that can be traced to improper 
arrangement of case_when () statements. 


In reality, there /s still another solution to avoid misclassifying the person 

with cough and anosmia/ageusia. That is to add 

symp anosmia_or ageusia != "Yes" (not equal to “Yes”) to the 
CHALLENGE conditions for “Possible COVID”. Can you think of why this works? 


AA yaounde symptoms slice %>% 


mutate (covid_status = case_when ( 
Shilo) Cote == Vee e Syauloy cnoemie Cie elefSbisalel Y= Megi ~ 
TPOS voles COVDDLT, 
eyno enosmie ha eGetgie == Wigs = Vizicioloyellollicy (ClON/IN ID) )) 


With the flu_linelist dataset, create a new column called 
follow up priority that implements the following schema: 


e Women should be considered “High priority” 
PRACTICE e All children (under 18 years) of any gender should be considered 
“Highest priority”. 
e Everyone else should have the value “No priority” 
(in RMD) 


# Complete the code with your answer: 
OL eieskoueshieye groups <= 

ieee) neee see 

mMubare (rolllowsuUps pr Tomy — 


) 


Binary conditions: dplyr::if else() 


— 


Fig: the if else() conditions 


There is another {dplyr} verb similar to case_when() for when we want to apply a binary 
condition to a variable: if else().A binary condition is either TRUE or FALSE. 


if else () has a similar application as case _when() : if the condition is true, then one 
operation is applied, if the condition is false, the alternative is applied. The syntax is: 

if _else(CONDITION, IF TRUE, IF FALSE). As you can see, this only allows for a 
binary condition (not multiple cases, such as handled by case when () ). 


If we take one of the first examples about recoding the highest education variable, we 
can write it either with case when() or with if else(). 


Here is the version we already explored: 


yaounde educ %>% 
MUEAECN(MuGheSizaecdueaieemn—— 
case_when ( 
higheSeReducdeone cincee (MUMIVeroney = Doctoraten POSES 
secondary", 


TRUE ~ highest education 
)) 


And this is how we would write it using if _else(): 


20 


yaounde educ %>% 
mutate (highest education = 
abe else, 
higheseieducation sin. ce) (Univierstiryvl, DOCEORaEe)s, 
# if TRUE then we recode 
PPOSt-Secondaimny a 
# if FALSE then we keep default value 
highest _education 


)) 


As you can see, we get the same output, whether we use if else() Of case_when(). 


With the flu linelist data, make a new column, called age_group, 
that has the value “Below 50” for people under 50 and “50 and above” for 
people aged 50 and up. Use the if else() function. 


PRACTICE 
This is exactly the same question as your first practice question, but this 
time you need to use if _else(). 
(in RMD) 
# Complete the code with your answer: 
Q age group if else <- 
mime ee 
mutate (age group = if else( )) 
Wrap up 


Changing or constructing your variables based on conditions on other variables is one of 
the most repeated data wrangling tasks. To the point it deserved its very own lesson ! 


| hope now that you will feel comfortable using case _when() and if _else() within 
mutate () and that you are excited to learn more complex {dplyr} operations such as 
grouping variables and summarizing them. 


See you next time! 


mutate(if_else( E a )) 


y 


Case when F 


If/else 


mutate(case_when( a E 
ae 
mew) 


/ 
on n 


v 
E & 
E 8 
E nE 


Fig: the conditional mutate () options 


Contributors 


The following team members contributed to this lesson: 


À LAURE VANCAUWENBERGHE 


Data analyst, the GRAPH Network 
A firm believer in science for good, striving to ally programming, health 
and education 


B KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


References 


Some material in this lesson was adapted from the following sources: 


¢ Horst, A. (2022). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original 
work published 2020) 


e Create, modify, and delete columns — Mutate. (n.d.). Retrieved 21 February 2022, 
from https://dplyr.tidyverse.org/reference/mutate.html 


Artwork was adapted from: 


22 


23 


e Horst, A. (2022). R & stats illustrations by Allison Horst. https://github.com 
/allisonhorst/stats-illustrations (Original work published 2018) 


Lesson notes | Grouping and summarizing 
data 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


WOU AO Gay bares ede cs Sale an os aoe aek eee deere eueeNe by dake dees 
LPAV ING ODJECVES sereset ones ewes ES as Hh AEs A OED OOS ODEO ERS ob ede 
The Yaounde COVID-19 dataset ca esa desd ade wed te) aes GR 1G hoe E eS ow 9 Oe 64946804 2 4% 
What are summary statistics? ods ace ccd oad Che wh Ae deg Rohde a eee EEA E SEED E ESS 
INEPOCUCING Cplyr: tsummArICe () acres ese gncdbedes tected tesee hens a2 GPa deeaa 
Grouped summaries With dplyr::group by (). .4««<9.i6eSentariaseend eet hee kaea 
Grouping by multiple variables (nested grouping) .......... 00. ee eee 
Ungrouping with dplyr::ungroup() (why and NOW) ....46.6645.0ede reer ages debe 
CONN TOWS + cep soe teense ewed oes DOSE eae he Os Foe eee eee eae 864 40 

Counting rows that meet SCONCIOON 6 oc ccs nas cece ces esr eee ceed ond dad gee ecarae 

iol eres) 0]: 14 6 renee aae re rem are eee ames Some rer ee eee oe ee ee eer ee ee 
Including missing combinations in Summaries ....... 0.0... 00 eee 
Wap UD fad oe oad easpa ameina ai ap obs hid eb oop eda be bee he dee NESE MES Ue kalon 


Introduction 


You currently know how to keep your data entries of interest, how keep relevant variables 
and how to modify them or create new ones. 


Now, we will take your data wrangling skills one step further by understanding how to 
easily extract summary statistics, through the verb summarize (), such as calculating the 
mean of a variable. 


Moreover, we will begin exploring a crucial verb, group _by(), capable of grouping your 
variables together to perform grouped operations on your data set. 


Let's go! 


Learning objectives 


. You can use dp] 


. You can use dp] 


lyr::summarize() to extract summary statistics from datasets. 


lyr::group_by() to group data by one or more variables before 


performing operations on them. 


. You understand 


per group. 


why and how to ungroup grouped data frames. 


. You can use dplyr::n() together with group _by()-summarize() to count rows 


. You can use sum() together with group_by()-summarize() to count rows that 


meet a condition. 


. You can use dplyr::count () as a handy function to count rows per group. 


The Yaounde COVID-19 dataset 


In this lesson, we will again use data from the COVID-19 serological survey conducted in 
Yaounde, Cameroon. 


yaounde <- read_csv(here::here('data/yaounde data.csv')) 


# A smaller subset of variables 
yao <- yaounde %>% select ( 
age, age categor 
neighborhood, is smoker, is pregnant, occupation, 


y_3, sex, weight_kg, height_cm, 


treatment combinations, symptoms, n_ days miss work, n bedridden days, 
highest education, igg result) 


yao 


# A tibble: 97 
age age c 
<db1l> <chr> 
45 Adul 

55 Adul 

23 Adul 

20 Adul 

55 Adul 

17 Chil 

13 Chil 

28 Adul 

30 

13 Child 

. with 961 m 
neighborho 


Os (03. IGF Gk “er Gk ‘et 


ia 


D 
o 
Q 
= 


OU WANA oO RFWN EF 


SoH e 


1x 15 
ategory 3 sex weight kg height_cm 


<chr> <db1l> <db1l> 
Female 95 169 
Male 96 185 
Male 74 180 
Female 70 164 
Female 67 147 
Female 65 162 
Female 65 150 
Male 62 173 
Male 73 170 
Female 56 153 
ore rows, and 10 more variables: 


od <chr>, is smoker <chr>, 


See the first lesson in this chapter for more information about this dataset. 


What are summary Statistics? 


A summary Statistic is a single value (such as a mean or median) that describes a 
sequence of values (typically a column in your dataset). 


What is a summary statistic? 


Summary statistics 
describe a sequence 


of values... ..with a single value 
Age 
9 
1 
[3| 
4 Mean: 3.375 
1 
1 
3 


Summary statistics can describe the center, spread or range of a variable, or the counts 
and positions of values within that variable. Some common summary statistics are shown 
in the diagram below: 


Examples of summary statistics 


age <- (9, 1, 4, 2, 2, 2) 


Summary statistic R code Output 
Counts 

No. of elements dplyr: :n(age) 6 
No. of distinct elements dplyr::n_distinct(age) 4 
Position 

First element dplyr: :first(age) 9 
Last element dplyr: :last(age) 2 
3rd element dplyr::nth(age, 3) 4 
Center 

Mean mean(age) 3.3 
Median median(age) 2 
Spread 

Standard deviation sd(age) 2.9 
Interquartile range IQR(age) 1.5 
Range 

Minimum min(age) 1 
Maximum max(age) 9 
25th quantile quantile(age, 0.25) 2 


Computing summary statistics is a very common operation in most data analysis 
workflows, so it will be important to become fluent in extracting them from your 
datasets. And for this task, there is no better tool than the {dplyr} function summarize ()! 
So let’s see how to use this powerful function. 


Introducing dplyr: :summarize () 


To get started, it is best to first consider how to get simple summary statistics w/thout 
using summarize (), then we will consider why you should actually use summarize (). 


Imagine you were asked to find the mean age of respondents in the yao data frame. How 
might you do this in base R? 


First, recall that the dollar sign function, $, allows you to extract a data frame column to a 
vector: 


yaoSage # extract the ‘age’ column from `yao` 


To obtain the mean, you simply pass this yaoSage vector into the mean () function: 


mean (yaoSage) 


## [1] 29.01751 


And that’s it! You now have a simple summary statistic. Extremely easy, right? 


So why do we need summarize() to get summary statistics if the process is already so 
simple without it?We'll come back to the why question soon. First let's see how to obtain 
summary statistics with summarize (). 


Going back to the previous example, the correct syntax to get the mean age with 
summarize () would be: 


yao S>% 
summarize (mean_age = mean (age) ) 


## # A tibble: 1 x 1 
## mean_age 
tt <db1> 
## 1 29.0 


The anatomy of this syntax is shown below. You simply need to input name of the new 
column (e.g. mean_age), the summary function (e.g. mean () ), and the column to 
summarize (e.g. age). 


New column name 


df %>% summarize(mean_age = mean(age) ) 


Column to summarize 


Fig. Basic syntax for the summarize () function. 


You can also compute multiple summary statistics in a single summarize() statement. 
For example, if you wanted both the mean and the median age, you could run: 


yao S>% 
summarize (mean age = mean(age), 


median age = median (age) ) 


## # A tibble: 1 x 2 
## | mean_age median_age 


## <db1> <db1> 
## 1 29.0 26 
Nice! 


Now, you should be wondering why summarize() puts the summary statistics into a data 
frame, with each statistic in a different column. 


The main benefit of this data frame structure is to make it easy to produce grouped 
summaries (and creating such grouped summaries will be the primary benefit of using 


summarize () ). 


We will look at these grouped summaries in the next section. For now, attempt the 
practice questions below. 


Use summarize() and the relevant summary functions to obtain the 
mean, median and standard deviation of respondent weights from the 
weight_kg variable of the yao data frame. 
PRACTICE : 
Your output should be a data frame with three columns named as shown 
below: 


(in RMD) mean_weight_kg median_weight_kg sd_weight_kg 


Q weight summary <- 
yao %>% 


Use summarize() and the relevant summary functions to obtain the 
minimum and maximum respondent heights from the height cm 
variable of the yao data frame. 

PRACTICE 
Your output should be a data frame with two columns named as shown 
below: 


(in RMD) oy ace aa 


Q height summary <- 
yao %>% 


PRACTICE 
& .CHECK Q height summary () 
-HINT Q height summary () 


(in RMD) 


Grouped summaries with dplyr::group_by() 


As its name suggests, dplyr::group_by() lets you group a data frame by the values in a 
variable (e.g. male vs female sex). You can then perform operations that are split 
according to these groups. 


What effect does group_by() have on a data frame? Let's try to group the yao data 
frame by sex and observe the effect: 


yao S>% 
group by (Sex) 


# A tibble: 971 x 15 
# Groups: sex [2] 
age age category 3 sex weight kg height_cm 
<dbl> <chr> <chr> <db1l> <db1> 
1 45 Adult Female 95 169 
2 55 Adult Male 96 185 
3 23 Adult Male 74 180 
4 20 Adult Female 70 164 
3 55 Adult Female 67 147 
6 I7 Chidd Female 65 162 
7 13 Child Female 65 150 
8 28 Adult Male 62 173 
9 30 Adult Male 73 170 
10 13 Child Female 56 t53 
# .. with 961 more rows, and 10 more variables: 
# neighborhood <chr>, is smoker <chr>, 


Hmm. Apparently nothing happened. The one thing you might notice is a new section in 
the header that tells you the grouped-by variable—sex—and the number of groups—2: 


# A tibble: 971 x 10 
o> # Groups: sex [2] => 


Apart from this header however, the data frame appears unchanged. 


But watch what happens when we chain the group _by () with the summarize () call we 
used in the previous section: 


yao S>% 
group _by(sex) %>% 
summarize (mean age = mean (age) ) 


## # A tibble: 2 x 2 


Ht sex mean_age 
Ht <chr> <db1> 
## 1 Female 29.5 
## 2 Male 28.4 


You get a different summary statistic for each group! The statistics for women are in one 
row and those for men are in another. (From this output data frame, you can tell that, for 
example, the mean age for female respondents is 29.5, while that for male respondents is 
28.4) 


As was mentioned earlier, this kind of grouped summary is the primary reason the 
summarize () function is so useful! 


Let’s see another example of a simple group _by() + summarize () operation. 


Suppose you were asked to obtain the maximum and minimum weights for individuals in 
different neighborhoods in the yao data frame. First you would group _by() the 
neighbourhood variable, then call the max () and min () functions inside summarize (): 


yao S>% 
group _by(neighborhood) %>% 
summarize (max weight = max(weight_kg), 
min weight = min(weight_kg) ) 


# A tibble: 9 x 3 

neighborhood max weight min weight 

<chr> <db1> <db1l> 
1 Briqueterie 128 20 
2 Carriere 129 14 
3 Cité Verte 118 16 
4 Ekoudou 135 15 
5 Messa 96 19 
6 Mokolo 162 16 
7 Nkomkana 161 15 
8 Tsinga 105 15 
9 Tsinga Oliga 100 17 


Great! With just a few code lines you are able to extract quite a lot of information. 


Let's see one more example for good measure. The variable n days miss work tells us 
the number of days that respondents missed work due to COVID-like symptoms. 
Individuals who reported no COVID-like symptoms have an NA for this variable: 


10 


yao S>% 
select (n_days miss work) 


# A tibble: 971 x 1 
n_days_ miss work 
<db1> 


COMIN DOBRWNHPE 
Z 
D> 


NA 
. with 961 more rows 


se H 


To count the total number of work days missed for each sex group, you could try to run 
the sum () function on the n days miss work variable: 


yao S>% 
group _by(sex) %>% 
summarise (total days missed = sum(n_days miss work) ) 


## # A tibble: 2 x 2 


Ht sex total days missed 
## = <chr> <db1> 
## 1 Female NA 
## 2 Male NA 


Hmmm. This gives you NA results because some rows in then days miss work column 
have NAs in them, and R cannot find the sum of values containing an NA. To solve this, the 
argument na.rm = TRUE is needed: 


Ey 


yao S>% 
group _by(sex) %>% 
summarise (total days missed = sum(n days miss work, na.rm = TRUE)) 


## # A tibble: 2 x 2 


Ht sex total days missed 
## = <chr> <db1> 
## 1 Female 256 
## 2 Male 272 


The output tells us that across all women in the sample, 256 work days were missed due 
to COVID-like symptoms, and across all men, 272 days. 


So hopefully now you see why summarize () is so powerful. In combination with 
group by (), it lets you obtain highly informative grouped summaries of your datasets 
with very few lines of code. 


Producing such summaries is a very important part of most data analysis workflows, so 
this skill is likely to come in handy soon! 


| ll is | 


PRACTICE 


(in RMD) 


PRACTICE 


(in RMD) 


summarize () produces “Pivot Tables” 


The summary data frames created by summarize() are often called 
Pivot Tables in the context of spreadsheet software like Microsoft Excel. 


Use group _by() and summarize () to obtain the mean weight (kg) by 
smoking status in the yao data frame. Name the average weight column 
weight mean 


The output data frame should look like this: 


-is_smoker weight_mean 
Ex-smoker 

Non-smoker 

Smoker 

NA 


Olverght tby smokingistacusi S 
yao %>% 


Use group by (), summarize (), and the relevant summary functions to 
obtain the minimum and maximum heights for each sex in the yao data 
frame. 


Your output should be a data frame with three columns named as shown 
below: 
sex min_height_cm max_height_cm 
Female 
Male 


Caia eaaa aaa a a e aaa a aa | 


es 


PRACTICE olminimax herghit bS 
yao %>% 


(in RMD) 


Use group_by(), summarize (), and the sum() function to calculate the 
total number of bedridden days (from the n bedridden days variable) 
reported by respondents of each sex. 


Your output should be a data frame with two columns named as shown 
PRACTICE below: 


sex total_bedridden_days 
(in RMD) Female 
Male 


Q sum bedridden days <- 
yao %>% 


Grouping by multiple variables (nested grouping) 


It is possible to group a data frame by more than one variable. This is sometimes called 


“n 


Le 


ested” grouping. 


t's see an example. Suppose you want to know the mean age of men and women /n 


each neighbourhood (rather than the mean age of a// women), you could put both sex 
and neighborhood in the group by () statement: 


y 


ao S>% 

group by(sex, neighborhood) %>% 

summarize (mean age = mean (age) ) 

## “Summarise()* has grouped output by 'sex'. You can override using the 


## `.groups` argument. 


## # A tibble: 18 x 3 
## # Groups: sex [2] 


1 Female Briqueterie 
2 Female Carriere 

3 Female Cité Vert 

4 Female Ekoudou 

5 Female Messa 

6 Female Mokolo 

7 Female Nkomkana 

8 Female Tsinga 

9 Female Tsinga Oliga 
10 Male Briqueterie 
11 Male Carriere 
12 Male Cité Verte 
13 Male Ekoudou 
14 Male essa 
15 Male Mokolo 
16 Male Nkomkana 
17 Male Tsinga 
18 Male Tsinga Oliga 


31; 
28. 
31... 
29. 
30; 
28. 
33% 
30. 
24. 
33s 
30. 
27. 
25; 
23. 
30: 
29. 
28. 
24. 


Ww © O own oo IW oO ON WON GA 


The order of the columns listed in group _by() is interchangeable. So if you run 


From this output data frame you can tell that, for example, women from Briqueterie have 
a mean age of 31.6 years, while men from Briqueterie have a mean age of 33.7 years. 


group by (neighborhood, sex) instead of group by(sex, neighborhood), you'll get 


yao S>% 
group by (nei 


summarize (m 


## “summarise 
## using the 


ghborhood, sex) %>% 
)) 


an_age = mean (age 


()` has grouped output by 


`.groups`ò argument. 


the same result, although it will be ordered differently: 


mean_age 
<db1> 


SL. 
33:5 
28. 
30. 
31... 
27. 
29% 
256 
30. 
23% 
28. 
30; 
33% 
29: 


# A tibble: 18 x 3 
# Groups: neighborhood [9] 
neighborhood sex 
<chr> <chr> 
1 Briqueteri Femal 
2 Briqueterie Male 
3 Carriere Female 
4 Carriere Male 
5 Cité Vert Femal 
6 Cité Verte Male 
7 Ekoudou Female 
8 Ekoudou Male 
9 Messa Female 
10 Messa Male 
11 Mokolo Female 
12 Mokolo Male 
13 Nkomkana Female 
14 Nkomkana Male 


oo oO O NM WO © Oo N SI G 


"neighborhood'. You can override 


14 


## 15 Tsinga Female 30.6 
## 16 Tsinga Male 28.8 
## 17 Tsinga Oliga Female 24.3 
## 18 Tsinga Oliga Male 24.3 


Now the column order is different: neighborhood is the first column, and sex is the 
second. And the row order is also different: rows are first ordered by neighborhood, then 
ordered by sex within each neighborhood. 


But the actual summary statistics are the same. For example, you can again see that 
women from Briqueterie have a mean age of 31.6 years, while men from Briqueterie have 
a mean age of 33.7 years. 


PRACTICE 


(in RMD) 


Using the yao data frame, group your data by gender (sex) and 
treatments (treatment combinations) using group_by. Then, using 
summarize () and the relevant summary function, calculate the mean 
weight (weight kg) for each group. 


Your output should be a data frame with three columns named as shown 
below: 


sex treatment_combinations mean_weight_kg 


Q weight by sex treatments <- 
yao %>% 


Using the yao data frame, group your data by age category 

(age category 3), gender (sex), and IgG results (igg_ result) using 
group_by. Then, using summarize () and the relevant summary function, 
calculate the mean number of bedridden days (n_ bedridden days) for 
each group. 


Your output should be a data frame with four columns named as shown 
below: 
age_category_3 sex igg_result mean_n_bedridden_days 


Olbedridden by rage sex rogre sulitin 
yao %>% 


Ungrouping with dplyr: :ungroup() (why and how) 


When you group by() more than one variable before using summarize (), the output 
data frame is still grouped. This persistent grouping can have unwanted downstream 
effects, so you will sometimes need to use dplyr: :ungroup() to ungroup the data 
before doing further analysis. 


To understand why you should ungroup () data, first consider the following example, 
where we group by only one variable before summarizing: 


yao S>% 
group _by(sex) %>% 
summarize (mean age = mean (age) ) 
## # A tibble: 2 x 2 
## sex mean age 
##  <chr> <db1> 
## 1 Female 29.5 
## 2 Male 28.4 


The data comes out like a normal data frame; it is not grouped. You can tell this because 
there is no information about groups in the header. 


But now consider when you group by two variables before summarizing: 


yao 


Q Q 
TIG 


group_by (sex, neighborhood) %>% 
súmmarize (mean age = mean (age) ) 


Ht 
Ht 


“summarise()* has grouped output by 'sex'. You can override using the 


groups argument. 


# A tibble: 18 x 3 


# Groups: sex [2] 
sex neighborhood mean_age 
<chr> <chr> <db1l> 
1 Female Briqueterie 31:.°6 
2 Female Carriere 28.2 
3 Female Cité Vert 31.8 
4 Female Ekoudou 29.3 
5 Female Messa 30.2 
6 Female Mokolo 28.0 
7 Female Nkomkana 33:50 
8 Female Tsinga 30.6 
9 Female Tsinga Oliga 24.3 
10 Male Briqueterie 33.34 


11 Male Carriere 30.0 
12 Male Cité Verte 24.0 
13 Male Ekoudou 2S2 
14 Male essa 239 
15 Male okolo 30.5 
16 Male Nkomkana 29.8 
17 Male Tsinga 28.8 
18 Male Tsinga Oliga 24.3 


Now the header tells you that the data is still grouped by the first variable in group _by(), 
Sex: 


# A tibble: 18 x 3 
<> # Groups: sex [2] 


What is the implication of this persistent grouping in the data frame? It means that the 
data frame may exhibit what seems like weird behavior when you try to apply some 
{dplyr} functions on it. 


For example, if you try to select () a single variable, perhaps the mean_age variable, you 
should normally be able to just use select (mean_age): 


yao S>% 
group _by (sex, neighborhood) %>% 
summarize (mean_age = mean(age)) %>% 
select (mean_age) # doesn't work as expected 
## “Summarise()* has grouped output by 'sex'. You can override using the 
## `.groups` argument. 
## Adding missing grouping variables: “sex” 


# A tibble: 18 x 2 
# Groups: sex [2] 
sex mean_age 
<chr> <db1l> 

1 Female 31.5.6 
2 Female 28.2 
3 Female 31.8 
4 Female 29:33 
5 Female 30.2 
6 Female 28.0 
7 Female 33:.0 
8 Female 30.6 
9 Female 24.3 
10 Male 331 
11 Male 30.0 
12 Male 210 
13 Male 25°62 
14 Male 299 


## 15 Male 
## 16 Male 
## 17 Male 
## 18 Male 


But as you can see, the grouped-by variable, sex, is st///selected, even though we only 
asked for mean _age inthe select () statement. 


This is one of the many examples of unique behaviors of grouped data frames. Other 
filter (), mutate () and arrange () also act in special ways on grouped 
data. We will address this in detail in a future lesson. 


dplyr verbs like 4 


30.55 
29.8 
28.8 
24.3 


So you now know why you should ungroup data when you no longer need it grouped. Let's 
now see how to ungroup data. It’s quite simple: just add the ungroup() function to your 
pipe chain. For example: 


yao S>% 


group by (sex, 


neighborhood) 


summarize (m 
ungroup () 


## `summarise ()` 


## `.groups` 


an_age = mean(ag 


argument. 


has grouped output by 


neighborhood mean_age 


# A tibble: 18 x 3 
sex 
<chr> <chr> 

1 Female Briqueterie 
2 Female Carriere 

3 Female Cité Vert 

4 Female Ekoudou 

5 Female Messa 

6 Female Mokolo 

7 Female Nkomkana 

8 Female Tsinga 

9 Female Tsinga Oliga 
10 Male Briqueterie 
11 Male Carriere 

12 Male Cité Verte 
13 Male Ekoudou 

14 Male essa 

15 Male okolo 

16 Male Nkomkana 

17 Male Tsinga 

18 Male Tsinga Oliga 


<db1l> 
31; 
28. 
31. 
29). 
30. 
28. 
33). 
30. 
24. 
335 
30. 
27s 
253 
23; 
30). 
29; 
28. 
24. 


6 
2 


W oO co Cl WON © ©@ =I] Wo. CO ON U G 


You can override using the 


Now that the data frame is ungrouped, it will behave like a normal data frame again. For 


example, you can select () any column(s) you want; you won't have some unwanted 


columns tagging along: 


yao S>% 
group by(sex, neighborhood) %>% 
summarize (mean age = mean (age) ) 
ungroup() %>% 


select (mean_age) 


## “Summarise()* has grouped output by 'sex'. You can override using the 
## `.groups`ò argument. 


# A tibble: 18 x 1 
mean age 
<db1> 
3L; 
28. 
BL 
29. 
30. 
28: 
33. 
30. 
24. 
33% 
30. 
27s 
25% 
23% 
30. 
29. 
28. 
24. 


AATIaAOFPWNHRF DO WMATA PWN EF 
WoO WA G O M O O w w a aO OO R WON DD 


Counting rows 


You can do a lot of data science by just counting and occasionally dividing. - 
Hadley Wickham, Chief Scientist at RStudio 


A common data summarization task is counting how many observations (rows) there are 
for each group. You can achieve this with the special n () function from {dplyr}, which is 
specifically designed to be used within summarise (). 


For example, if you want to count how many individuals are in each neighborhood group, 
you would run: 


yao %>% 
group by(neighborhood) %>% 
summarize (count = n()) 


## # A tibble: 9 x 2 
4 neighborhood count 
<chr> <int> 
1 Briqueterie 106 
## 2 Carriere 236 
## 3 Cité Verte 72 
## 4 Ekoudou 190 
5 Messa 48 
6 Mokolo 96 
7 Nkomkana 75 
## 8 Tsinga 81 
## 9 Tsinga Oliga 67 


As you can see, the n() function does not require any arguments. It just “knows its job” in 
the data frame! 


Of course, you can include other summary statistics in the same summarize () call. For 
example, below we also calculate the mean age per neighborhood. 


yao S>% 
group by (neighborhood) %>% 
summarize(count = n(), 


mean _age = mean (age) ) 


# A tibble: 9 x 3 
neighborhood count mean_age 
# <chr> <int> <dbl> 
## 1 Briqueterie 106 32.00 
2 Carriere 236 28.9 
3 Cité Verte 72 29.9 
## 4 Ekoudou 190 27.6 
## 5 Messa 48 2743 
6 Mokolo 96 291 
7 Nkomkana 75 31.7 
## 8 Tsinga 81 29.7 
## 9 Tsinga Oliga 67 24.3 


Group your yao data frame by the respondents’ occupation 


PRACTICE (occupation) and use summarize () to create columns that show: 


e how many individuals there are with each occupation (think of the 
(in RMD) n() function) 
e the mean number of work days missed (n days miss work) by 
those in that occupation 


20 


Your output should be a data frame with three columns named as shown 


below: 
PRACTICE es 
occupation count mean_n_days_miss_work 
(in RMD) oloceupat icon summary <- 
yao %>% 


Counting rows that meet a condition 


Rather than counting a// rows as above, it is sometimes more useful to count just the rows 
that meet specific conditions. This can be done easily by placing the required conditions 
within the sum() function. 


For example, to count the number of people under 18 in each neighborhood, you place 
the condition age < 18 inside sum(): 


yao S>% 
group _by(neighborhood) %>% 
summarize (count_under 18 = sum(age < 18)) 


# A tibble: 9 x 2 
neighborhood count_under 18 
<chr> <int> 

1 Briqueterie 28 

2 Carriere 58 

3 Cité Verte 19 

4 Ekoudou 66 

5 Messa 18 

6 Mokolo 32 

7 Nkomkana 22 

8 Tsinga 23 

9 Tsinga Oliga 25 


Similarly, to count the number of people with doctorate degrees in each neighborhood, 


you place the condition highest education == "Doctorate" inside sum(): 
yao S>% 
group_by (neighborhood) %>% 
summarize (count swith doczordres ksum(highest education -= VDocrorate™))) 


## # A tibble: 9 x 2 
## neighborhood count with doctorates 


## <chr> <int> 
## 1 Briqueterie 2 
## 2 Carriere 1 


21 


# 3 Cité Verte 1 

## 4 Ekoudou 1 

## 5 Messa 2 

## 6 Mokolo 0 

## 7 Nkomkana 4 

# 8 Tsinga 3 

## 9 Tsinga Oliga 3 
Under the hood: counting with conditions 
Why are you able to use sum() which is meant to add numbers, on a 
condition like highest education == "Doctorate"? 
Using sum() on a condition works because the condition evaluates to the 
Boolean values TRUE and FALSE. And these Boolean values are treated as 
numbers (where TRUE equals 1 and FALSE equals 0), and numbers can, of 
course, be summed. 
The code below demonstrates what is going on under the hood in a step- 
by-step way. Run through it and see if you can follow. 
demo of condition sums <- yao %>% 

select (hughest sedulecatalom)) 2>% 
MULAvel(WwArchy docrorare highest education =- NM Wocronmake!)io >> 
CHALLENGE mutate (numeric with doctorate = as.numeric(with_ doctorate) ) 


we 


demo of condition sums 


# A tibble: 971 x 3 

highest education with doctorate numeric with doctorate 
#4 <chr> <g> <db1> 
1 Secondary FALSE 0 
2 University FALSE 0 
3 University FALSE 0 
4 Secondary FALSE 0 
## 5 Primary FALSE 0 
6 Secondary FALSE 0 
7 Secondary FALSE 0 
## 8 Doctorate TRUE il 
9 Secondary FALSE 0 
10 Secondary FALSE 0 

# .. with 961 more rows 


The numeric values can then be added to produce a count of rows 
fulfilling the condition highest education == "Doctorate": 


22 


demo _of condition sums %>% 
summarize (count with doctorate = sum(numeric with doctorate) ) 


CHALLENGE 
K 


il 


## # A tibble: 1 x 
Am tt count with doctorate 


Ht 
## 1 


<db1> 
aly 


For a final illustration of counting with conditions, consider the 
treatment combinations variable, which lists the treatments received by people with 
COVID-like symptoms. People who received no treatments have an NA value: 


yao S>% 


If you want to count the number of people who received no treatment, you would sum up 


select (treatment combinations) 


# A tibble: 971 x 1 
treatment combinations 
<chr> 

Paracetamol 

<NA> 

<NA> 

Antibiotics 

<NA> 
Paracetamol--Antibiotics 
Traditional meds. 
Paracetamol 


OU WANA UO FWN EF 


<NA> 
. with 961 more rows 


aE 


Paracetamol--Traditional meds. 


those who meet the is.na (treatment combinations) condition: 


yao S>% 


23 


group_by (neighborhood) %>% 
summarize (unknown treatments = 


sum(is.na(treatment_combinations) ) ) 


# A tibble: 9 x 2 

neighborhood unknown treatments 

<chr> <int> 
1 Briqueterie 82 
2 Carriere 192 
3 Cité Verte 46 
4 Ekoudou 133 
5 Messa 35 
6 Mokolo 65 


## 7 Nkomkana 53 
## 8 Tsinga 56 
## 9 Tsinga Oliga 47 


These are the people with NA values for the treatment_combinations column. 


To count the people who did receive some treatment, you can simply negate the is.na() 
function with !: 


yao S>% 
group by (neighborhood) %>% 
summarize (known treatments = sum(!is.na(treatment_combinations) ) ) 


# A tibble: 9 x 2 
neighborhood known treatments 
# <chr> <int> 
1 Briqueterie 24 
2 Carriere 44 
3 Cité Verte 26 
4 Ekoudou 57 
## 5 Messa 13 
6 Mokolo 31 
7 Nkomkana 22 
## 8 Tsinga 25 
## 9 Tsinga Oliga 20 


Group your yao data frame by the respondents’ symptoms (symptoms) 
and use the sum() function to count how many adults have each 
symptom combination. 


PRACTICE Your output should be a data frame with two columns named as shown 
below: 


symptoms sum_adults 
(in RMD) 


Q symptoms adults <- 
yao %>% 
group by (GROUPED VARIABLE HERE) %>% 
summarise(sum adults = sum(HERE, INPUT A CONDITION TO MATCH 
ADULTS) ) 


dplyr::count() 


The dplyr::count () function wraps a bunch of things into one beautiful friendly line of 
code to help you find counts of observations by group. 


24 


Let’s use dplyr::count() on our occupation variable: 


yao S>% 
count (occupation) 


# A tibble: 28 x 2 
occupation n 
<chr> <int> 
1 Farmer 5 
2 Farmer--Other I 
3 Home-maker 65 
4 Home-maker--Farmer 2 
5 Home-maker--Informal worker 3 
6 Home-maker--Informal worker--Farmer 1 
7 Home-maker--Trader 3 
8 Informal worker 189 
9 Informal worker--Other 2 
10 Informal worker--Trader 4 
# .. with 18 more rows 


Note that this is the same output as: 
yao S>% 


group by(occupation) %>% 
summarize(n = n()) 


# A tibble: 28 x 2 


occupation n 

<chr> <int> 
1 Farmer 5 
2 Farmer--Other 1 
3 Home-maker 65 
4 Home-maker--Farmer 2 
5 Home-maker--Informal worker 
6 Home-maker--Informal worker--Farmer i 
7 Home-maker--Trader 3 
8 Informal worker 189 
9 Informal worker--Other 2 
0 Informal worker--Trader 4 


aR 


. with 18 more rows 


You can also apply dplyr::count () in a nested fashion: 


yao S>% 
count (sex, occupation) 


## # A tibble: 40 x 3 
## sex occupation n 
#4 <chr> <chr> <int> 


25 


## 1 Female Farmer 3 

## 2 Female Home-maker 65 

## 3 Female Home-maker--Farmer 2 

## 4 Female Home-maker--Informal worker 3 

## 5 Female Home-maker--Informal worker--Farmer 1 

## 6 Female Home-maker--Trader 3 

## 7 Female Informal worker 77 

## 8 Female Informal worker--Trader 1 

## 9 Female No respons 8 

## 10 Female Other 6 

## # .. with 30 more rows 
The count () verb gives you key information about your dataset in a very 
quick manner. Let's look at our IgG results stratified by age category and 
sex in one line of code. 
Using the yao data frame, count the different combinations of gender 
(sex), age categories (age category 3) and IgG results (igg result). 
Your output should be a data frame with four columns named as shown 
below: 

sex age_category_3 igg_result n 

PRACTICE ee 

O cabine Lake epreerepulllieis! eieiei es Joy, Se vellefstecheieiojena E 
yao %>% 
(in RMD) 


Using the yao data frame, count the different combinations of age 
categories (age_category 3) and number of bedridden days 
(n bedridden days). 


Your output should be a data frame with three columns named as shown 
below: 


age_category_3 n_bedridden_days n 


ORcCountE boderddenwacgercatedordwes <> 
yao %>% 


The downside of count () is that it can only give you a single summary statistic in the 
data frame. When you use summarize() andn() you can include multiple summary 
Statistics. For example: 


yao S>% 
group _by(sex, neighborhood) %>% 
summarize (count = n(), 


median_age median (age) ) 


## “Summarise()~ has grouped output by 'sex'. You can override using the 
## `.groups` argument. 


# A tibble: 18 x 4 
# Groups: sex [2] 
4 sex neighborhood count median_age 
# <chr> <chr> <int> <db1> 
1 Female Briqueterie 61 28 
2 Female Carriere 140 25:5 
4 3 Female Cité Vert 44 28 
# 4 Female Ekoudou 110 26.5 
5 Female Messa 26 27.5 
6 Female Mokolo 53 23 
7 Female Nkomkana 43 28 
4 8 Female Tsinga 42 29 
4 9 Female Tsinga Oliga 30 23.45 
## 10 Male Briqueterie 45 28 
11 Male Carriere 96 27 
12 Male Cité Verte 28 22.5 
## 13 Male Ekoudou 80 21.5 
## 14 Male Messa 22 24.5 
15 Male Mokolo 43 32 
16 Male Nkomkana 32 27 
17 Male Tsinga 39 27 
## 18 Male Tsinga Oliga 37 21 


But count () can only yield counts: 


yao S>% 
group by (sex, neighborhood) %>% 
count () 


## # A tibble: 18 x 3 
## # Groups: sex, neighborhood [18] 
4 sex neighborhood n 
<chr> <chr> <int> 
1 Female Briqueterie 61 
+ 2 Female Carriere 140 
4 3 Female Cité Vert 44 
4 Female Ekoudou 110 
5 Female Messa 26 
4 6 Female Mokolo 53 
# 7 Female Nkomkana 43 
8 Female Tsinga 42 
9 Female Tsinga Oliga 30 


27 


10 Male Briqueterie 45 
11 Male Carriere 96 
12 Male Cité Verte 28 
13 Male Ekoudou 80 
14 Male essa 22 
15 Male okolo 43 
16 Male Nkomkana 32 
17 Male Tsinga 39 
18 Male Tsinga Oliga 37 


Including missing combinations in summaries 


When you use group by () and summarize() on multiple variables, you obtain a 
summary Statistic for every unique combination of the grouped variables. For instance, 
consider the code and output below, which counts the number of individuals in each age- 
sex group: 


yao S>% 
group _by(sex, age category 3) %>% 
summarise (number of individuals = n()) 
## “Summarise()* has grouped output by 'sex'. You can override using the 


## `.groups`ò argument. 


# A tibble: 6 x 3 
# Groups: sex [2] 
sex age category 3 number of individuals 
<chr> <chr> <int> 
1 Female Adult 368 
2 Female Child 155 
3 Female Senior 26 
4 Male Adult 267 
5 Male Child 136 
6 Male Senior 19 


In the output data frame, there is one row for each combination of sex and age group 
(Female—Adult, Female—Child and so on). 


But what happens if one of these combinations is not present in the data? 


Let’s create an artificial example to observe this. With the code below, we artificially drop 
all male children from the yao data frame: 


yao no male children <= 
yao S>% 
meen (sex == Mellen ~ eles Ceiisc@igy 8: == HCM tlic!) }) 


28 


Now if you run the same group by() and summarize() callon yao no male children, 


you'll notice the missing combination: 


yao _no male children %>% 
group by (sex, 


age category 3) 
summarise (number of individuals 


Q Q 
Pac 


o O) 


## `summarise()` has grouped output by 'sex 
## `.groups` argument. 
# A tibble: 5 x 3 
# Groups: sex [2] 
sex age category 3 number of individuals 
<chr> <chr> <int> 
1 Female Adult 368 
2 Female Child 155 
3 Female Senior 26 
4 Male Adult 267 
5 Male Senior 19 


'. You can override using the 


Indeed, there is no row for male children. 


But sometimes it is useful to include such missing combinations in the output data frame, 
with an NA or O value for the summary statistic. 


To do this, you can run the following code instead: 


yao no male children %>% 
# convert variables to factors 
mutate (sex as.factor(sex), 
age category 3 as.factor(age category 3)) 
# Note the the .drop FALSE argument — 7 


o Q 
> 


group iby (sex, age category drop TEATSE oek 
summarise (number of individuals = n()) 
## “Summarise()~ has grouped output by 'sex'. You can override using the 
## `.groups` argument. 
# A tibble: 6 x 3 
# Groups: sex [2] 
sex age category 3 number of individuals 
siota <£GES <int> 
1 Female Adult 368 
2 Female Child 155 
3 Female Senior 26 
4 Male Adult 267 


29 


## 5 Male Child 0 
## 6 Male Senior 19 


What does the code do? 


e First it converts the grouping variables to factors with as. factor() (inside a 
mutate () call) 


e Then it uses the argument .drop = FALSE in the group by () function to avoid 
dropping the missing combinations. 


Now you have a clear 0 count for the number of male children! 


Let’s see one more example, this time without artificially modifying our data. 


The code below calculates the average age by sex and education group: 


yao S>% 
group by(sex, highest education) %>% 
summarise (mean age = mean (age) ) 
## “Summarise()~ has grouped output by 'sex'. You can override using the 


## `.groups`ò argument. 


# A tibble: 13 x 3 
# Groups: sex [2] 
sex highest education mean_age 
<chr> <chr> <db1> 
1 Female Doctorate 28 
2 Female No formal instruction 45.6 
3 Female No respons 39 
4 Female Primary 26.8 
5 Female Secondary 28.8 
6 Female University Sis 
7 Male Doctorate 42.2 
8 Male No formal instruction 37.9 
9 Male No response 22 
10 Male Other 5.5 
11 Male Primary 22.9 
12 Male Secondary 29.4 
13 Male University 31.29 


Notice that in the output data frame, there are 7 rows for men but only 6 rows for 
women, because no woman answered “Other” to the question on highest education level. 


If you nonetheless want to include the “Female—Other” row in the output data frame, you 
would run: 


30 


y 


31 


ao %>5 
mutate (sex = as.factor (sex), 
highest education = as.factor (highest _education)) %>% 
group by (sex highest education, .drop = FALSE) %>% 
summarise (mean age = mean (age) ) 


## `summarise()` has grouped output by 'sex'. You can override using the 
## `.groups` argument. 


## # A tibble: 14 x 3 


## # Groups: sex [2] 

## sex highest education mean_age 
#4 <fct> <fct> <db1> 
## 1 Female Doctorate 28 
## 2 Female No formal instruction 45.6 
## 3 Female No respons 35 
## 4 Female Other NaN 
## 5 Female Primary 26.8 
## 6 Female Secondary 28.8 
## 7 Female University 31.35 
## 8 Male Doctorate 42.2 
## 9 Male No formal instruction 37.9 
## 10 Male No response 22 
## 11 Male Other 5.25 
## 12 Male Primary 22.9 
## 13 Male Secondary 29.4 
## 14 Male University 31.9 


Using the yao data frame, let’s calculate the median age when grouping 
by neighborhood, age_category, and gender 


Note, we want all possible combinations of these three variables (not just 
those present in our data). 


PRACTICE Pay attention to two data wrangling imperatives! 


(in RMD) e convert your grouping variables to factors beforehand using 
mutate () 
e calculate your statistic, the median, while removing any NA values. 


Your output should be a data frame with four columns named as shown 
below: 


neighborhood age_category_3 sex median_age 


PRACTICE 


Q median age by neighborhood agecategory sex <- 


o 


yao %>% 


(in RMD) 


ae a re ee es | 
Why include missing combinations? 


Above, we mentioned that including missing combinations is often useful 
in the data analysis workflow. Let’s see one use case: plotting with 
{ggplot}. If you have not yet learned {ggplot}, that is okay, just focus on 
the plot outputs. 


To make a dodged bar chart with the age-sex counts of 
yao no male children, you could run: 


SIDE NOTE | 
yao no male children %>% 


groupiby (sex sage sedwegomy 93) 
summarise (number of individuals = n()) %>% 


ungroup() %>% 


# pass the output to ggplot 

ggplot OME 

geom Sol laess = esy y = nimbe OF sinchiyuclieule|, wiii = 
age leategor yi3; 


position = "dodge") 
## `summarise()` has grouped output by 'sex'. You can override 
using the 
## ~.groups~ argument. 


32 


33 


SIDE NOTE 


number_of_individuals 


300 - 


age_category_3 


B Aout 
B chia 
O Senior 


200 - 


100 - 


0- 


1 1 
Female Male 
sex 


Not very elegant! Ideally there should be an empty space indicating O for 
the number of male children. 


If you instead implement the procedure to include missing combinations, 
you get a more natural dodged bar plot, with an empty space for male 
children: 


yao no male children %>% 


mutate (sex = as.factor(sex), 


agelcacegory IS ais. facuor (age keategory Nikes 
group_by(sex, age category 3, .drop = FALSE) %>% 
summarise (number of individuals = n()) %>% 
ungroup() 3>% 


# pass the output to ggplot 

ggplor E 

geom col (asst = See, y = Mimer Q alioveliwalehbreulicy, e = 
age category 3), 


position = "dodge") 
## “Summarise()~ has grouped output by 'sex'. You can override 
using the 
## ~.groups” argument. 


300 - 
2) 
S 
g age_category_3 
5 3 
5 200 B Aout 
va 
5 Child 
5, E c 
8 ia Senior 
E 
=] 
c 

100- 


0- 


1 1 
Female Male 
sex 


SIDE NOTE Much better! 


By the way, this output can be improved slightly by setting the factor 
levels for age to their proper ascending order: first “Child”, then “Adult” 
then “Senior”: 


yao no male children %>% 
mutate (sex = as.factor(sex), 


age category 3 — factor (age, category 3, 
levels = c("Child", 
Nobile 
"Senior"))) %>% 
group by (sex, age icavegory 3, “drop = FALSE) > 
summarise (number of individuals = n()) %>% 


ungroup() 3>% 


# pass the output to ggplot 

gop lor (()) Sr 

Geon Collass l sex, y — Mimes Oe mimdayaichiall's) scr all 
age category 3), 


position = "dodge") 
## “Summarise()~ has grouped output by 'sex'. You can override 
using the 
## `.groups` argument. 


w 
A 


300 - 


age_category_3 
B chia 
B cut 
E Senior 


200 - 


SIDE NOTE 
a 1) 


Z| 


number_of_individuals 


100 - 


1 1 
Female Male 


Wrap-up 


You have now seen how to obtain quick summary statistics from your data, either for 
exploratory data or for further data presentation or plotting. 


Additionally, you have discovered one of the marvels of {dplyr}, the possibility to group 
your data using group by(). 


group_by() combined with summarize () Is a one of the most common grouping 
manipulations. 


Ens E 
E > E summarize() 
E 
[E] 
B 
a 

GROUP_BY ae a 

& 
SUMMARIZE group_by() > summarize() 
See gee ABE 

EDE” see” EE 
E T E 
m = 


Fig: summarize() and group_by() 


However, you can also combine group _ by () with many of the other {dplyr} verbs: this is 
what we will cover in our next lesson. See you soon ! 


Contributors 


The following team members contributed to this lesson: 


LAURE VANCAUWENBERGHE 


Data analyst, the GRAPH Network 
A firm believer in science for good, striving to ally programming, health 
and education 


FN 


(8) ANDREE VALLE CAMPOS 


R Developer and Instructor, the GRAPH Network 
Motivated by reproducible science and education 


KENE DAVID NWOSU 


od Data analyst, the GRAPH Network 


Passionate about world improvement 


Thank you to Alice Osmaston and Saifeldin Shehata for their comments and review. 


36 


References 


Some material in this lesson was adapted from the following sources: 


¢ Horst, A. (2022). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original 
work published 2020) 


e Group by one or more variables. (n.d.). Retrieved 21 February 2022, from https:// 
dplyr.tidyverse.org/reference/group_by.html 


e Summarise each group to fewer rows. (n.d.). Retrieved 21 February 2022, from 
https://dplyr.tidyverse.org/reference/summarize.html 


e The Carpentries. (n.d.). Grouped operations using dplyr. Grouped operations using 
‘dplyr’ - Introduction to R/tidyverse for Exploratory Data Analysis. Retrieved July 28, 


2022, from https://tavareshugo.github.io/r-intro-tidyverse-gapminder/06-grouped 
_operations_dplyr/index.html 


Artwork was adapted from: 


¢ Horst, A. (2022). R & stats illustrations by Allison Horst. https://github.com 
/allisonhorst/stats-illustrations (Original work published 2018) 


37 


Lesson notes | Grouped filter, mutate and 
arrange 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


(POU COR otcrlaeas odes Suse nia aie oe eta E E ser eee eels Ba deed ae 
Learnino COIN : au ai nea wee eee mea ES as hes A OD OES ODE OER NA 
ee ae pee dak aos 4a one Woda e 544 eee sabe oh Fs heey oeaheg 64445-5044 2% 
DOE e faerie es-ES EAE AE ORG Ue i OE RS HIE BA Ged ORE ERE ee 
Arranging Dy ONO sets da ae Rha eh seth one Roe eyed eee bane CR eee ad RES 

arrange () Can group automatically 46542642 se erie dai crdgentodeweyods cee Heras 
Pieri CMO: 6.54%. 36% 4.44 aes pte te eet ie E eee ae Re eR ee eRe ae 

Filtering with nested groupingS ........... 2... ccc cee eee eee 
Mutating DY OTOU i ee ripa eee eae ee oe ae a ee ce en ee ee ere ee ee ree 

Mutating with nested Groupings. i466 kos dos sg deed seed deden dew ead pew R RA Sa 
ON a Geb Ak 2k Fg Lah Gn E oor E Sh Ak Ed Oe he A as OG EO OOS BR oad he 


Introduction 


Data wrangling often involves applying the same operations separately to different 
groups within the data. This pattern, sometimes called “split-apply-combine’, is easily 
accomplished in {dplyr} by chaining the group _by() verb with other wrangling verbs like 
filter(),mutate(),and arrange () (all of which you have seen before!). 


In this lesson, you'll become confident with these kinds of grouped manipulations. 


Let's get started. 


oe 


Learning objectives 


1. You can use group by () with arrange (), filter(),and mutate () to conduct 
grouped operations on a data frame. 


Packages 
This lesson will require the {tidyverse} suite of packages and the {here} package: 


if (!require(pacman)) install.packages ("pacman") 
pacman::p load(tidyverse, here) 


Datasets 


In this lesson, we will again use data from the COVID-19 serological survey conducted in 
Yaounde, Cameroon. Below, we import the data, create a small data frame subset, yao 
and an even smaller subset, yao sex weight. 


yao <- 
read csv (here: :here('data/yaounde data.csv')) <>% 
select (Sex, age, age Category, werght kg, occupation, 19g result, 
OMB SUM) 


yao 


# A tibble: 5 x 7 
sex age age category weight kg occupation 
<chr> <dbl> <chr> <dbl> <chr> 
1 Female 45 45 - 64 95 Informal worker 
2 Male 55 45 - 64 96 Salaried worker 
3 Male 23 15 = 29 74 Student 
4 Female 20 15 = 29 70 Student 
5 Female 55 45 - 64 67 Trader--Farmer 
# .. with 2 more variables: igg result <chr>, 
# igm result <chr> 


yao sex vergit <> 
yao S>% 
select (sex, weight_kg) 


yao_sex_weight 


# A tibble: 5 x 2 
sex weight kg 
<chr> <dbl> 

1 Female 95 

2 Male 96 

3 Male 74 

4 Female 70 

5 Female 67 


For practice questions, we will also use the sarcopenia data set that you have seen 
previously: 


sarcopenia <- read _csv(here::here('data/sarcopenia elderly.csv') ) 


sarcopenia 


sex male 1 female 0 
<db1l> 


O QOG = © 


0 


m 


m 
m 
m 


m 


arital status 


<chr> 


arried 
arried 
arried 


widow 


arried 


4 more variables: height_meters <dbl>, 


# A tibble: 5 x 9 
number age age group 
<dbl> <dbl> <chr> 
1 7 60.8 Sixties 
2 8 72.3 Seventies 
3 9 62.6 Sixties 
4 a2. 72 Seventies 
5 13 60.1 Sixties 
# .. with 
# weight kg <dbl>, 


grip strength _ kg <dbl>, 


Arranging by group 


The arrange () function orders the rows of a data frame by the values of selected 
columns. This function is only sensitive to groupings when we set its argument 
.by_ group to TRUE. To illustrate this, consider the yao sex weight data frame: 


yao_sex weight 


# A tibble: 5 x 2 
sex weight_kg 
<chr> <db1l> 

1 Female 95 

2 Male 96 

3 Male 74 

4 Female 70 

5 Female 67 


We can arrange this data frame by weight like so: 


yao_sex weight %>% 


arrange (weight_kg) 


# A tibble: 5 x 2 
sex weight_kg 
<chr> <dbl> 

1 Female 14 

2 Male 15 

3 Male 15 

4 Male 15 

5 Female 15 


As expected, lower weights have been brought to the top of the data frame. 


If we first group the data, we might expect a different output: 


yao sex weight %>% 
group by(sex) %>% 
arrange (weight_kg) 


# A tibble: 5 x 2 
# Groups: sex [2] 
sex weight kg 
<chr> <dbl> 
1 Female 14 
2 Male 15 
3 Male 15 
4 Male 15 
5 Female 15 


But as you see, the arrangement is still the same. 


Only when we set the .by group argument to TRUE do we get something different: 


yao sex weight %>% 
group by(sex) %>% 
arrange (weight ikg, -by group = TRU 


BI 
~~ 


# A tibble: 5 x 2 
# Groups: sex [1] 
sex weight kg 
<chr> <db1> 
1 Female 14 
2 Female 15 
3 Female 16 
4 Female 16 
5 Female 18 


Now, the data is first sorted by sex (all women first), and then by weight. 


arrange () Can group automatically 


In reality we do not need group by() to arrange by group; we can simply put multiple 
variables in the arrange () function for the same effect. 


So this simple arrange () statement: 


yao sex weight %>% 
arrange (sex, weight kg) 


## # A tibble: 5 x 2 
## sex weight kg 


# # <chr> <dbl> 


## 1 Female 14 
## 2 Female 15 
## 3 Female 16 
## 4 Female 16 
## 5 Female 18 


is equivalent to the more complex group_by(), arrange () statement used before: 


yao_sex weight %>% 
group _by(sex) %>% 
arrange (weight_kg, -by group = TRUE) 


The code arrange (sex, weight_kg) tells R to arrange the rows first by sex, and then by 
weight. 


Obviously, this syntax, with just arrange (), and no group by() is simpler, so you can 
stick to it. 


desc () for descending order 


Recall that to arrange in descending order, we can wrap the target variable in desc (). So, 
for example, to sort by sex and weight, but with the heaviest people on top, we can run: 


yao sex weight %>% 
arrange (sex, desc(weight_kg) ) 


# A tibble: 5 x 2 
sex weight kg 
<chr> <db1> 
1 Female 162 
2 Female 161 
3 Female 158 
4 Female 135 
5 Female 129 
With an arrange () call, sort the sarcopenia data first by sex and then 
by grip strength. (If done correctly, the first row should be of a woman 
with a grip strength of 1.3 kg). To make the arrangement clear, you 
PRACTICE should first select () the sex and grip strength variables. 
(in RMD) # Complete the code with your answer: 


olgripilstrenorn arranged 
sarcopenia %>% 
select ( ) S>% 
arrange ( ) 


The sarcopenia dataset contains a column, age_group, which stores 
age groups as a String (the age groups are “Sixties”, “Seventies” and 
“Eighties”). Convert this variable to a factor with the levels in the right 
order (first “Sixties” then “Seventies” and so on). (Hint: Look back on the 


case when () lesson if you do not see how to relevel a factor.) 
PRACTICE 


Then, with a nested arrange () Call, arrange the data first by the newly- 
(in RMD) created age group factor variable (younger individuals first) and then by 
height_meters, with shorter individuals first. 


# Complete the code with your answer: 
@ eles epeoubls: heirghe 
sarcopenia 


ooo 


Filtering by group 


The filter () function keeps or drops rows based on a condition. If filter () is applied 
to grouped data, the filtering operation is carried out separately for each group. 


To illustrate this, consider again the yao sex weight data frame: 


yao_sex weight 


## # A tibble: 5 x 2 


++ sex weight _ kg 
++ <chr> <db1> 
## 1 Female 95 
## 2 Male 96 
## 3 Male 74 
## 4 Female 70 
## 5 Female 67 


If we want to filter the data for the heaviest person, we could run: 


yao_sex weight %>% 
filter (weight_kg == max(weight_kg) ) 


## # A tibble: 1 x 2 


++ sex weight _ kg 
## <chr> <db1> 
## 1 Female 162 


But if we want to get heaviest person per sex group (the heaviest man and the heaviest 
woman), we can use group by (sex) then filter (): 


yao sex weight %>% 
group _by(sex) %>% 
filter (weight kg == max(weight_kg) ) 


## # A tibble: 2 x 2 


## # Groups: sex [2] 
#t sex weight kg 
##  <chr> <dbl> 
## 1 Male 128 
## 2 Female 162 


Great! The code above can be translated as “For each sex group, keep the row with the 
maximum weight_kg value”. 


Filtering with nested groupings 


filter () will work fine with any number of nested groupings. 


For example, if we want to see the heaviest man and heaviest woman per age group we 
could run the following on the yao data frame: 


yao S>% 
group _by(sex, age category) %>% 
filter (weight kg == max(weight_kg) ) 


This code groups by sex and age category, and then finds the heaviest person in each sub- 
category. 


(Why do we have 10 rows in the output? Well, 2 sex groups x 5 groups age groups = 10 
unique groupings.) 


The output is a bit scattered though, so we can chain this with the arrange () function, 
to arrange by sex and age group. 


yao S>% 
group by(sex, age category) %>% 
filter (weight kg == max(weight_kg)) %>% 


arrange (sex, age category) 


Now the data is easier to read. All women come first, then men. But we see notice a weird 
arrangement of the age groups! Those aged 5 to 14 should come first in the 


arrangement. Of course, we've learned how to fix this—the factor () function, and its 
levels argument: 


yao S>% 

MU aeel(ageledvegqony I ractror | 

age category, 

levels: =e. (Woes TAU a 2 OM SOA AD eA Gor ki) 
)) S>% 
group _by (sex, age category) %>% 
filter (weight kg == max(weight_kg)) %>% 
arrange (sex, age category) 


Now we have a nice and well-arranged output! 


Group the sarcopenia data frame by age group and sex, then filter for 
PRACTICE the highest skeletal muscle index in each (nested) group. 


# Complete the code with your answer: 
Q max skeletal muscle index <- 
sarcopenia 


(in RMD) 


Mutating by group 


mutate () is used to modify columns or to create new ones. With grouped data, 
mutate () Operates over each group independently. 


Let's first consider a regular mutate () call, not a grouped one. Imagine that you wanted 
to add a column that ranks respondents by weight. This can be done with the rank () 
function inside a mutate () Call: 


yao sex weight %>% 
mutate (weight rank = rank(weight_kg)) 


# A tibble: 5 x 3 

sex weight _kg weight_rank 

<chr> <db1> <db1> 
1 Female 95 901 
2 Male 96 908 
3 Male 74 640. 
4 Female 70 564. 
5 Female 67 502. 


The output shows that the first row is the 901st lightest individual. But it would be more 
intuitive to rank in descending order with the heaviest person first. We can do this with 
the desc () function: 


yao sex weight %>% 
mutate (weight rank = rank(desc(weight_kg))) 


# A tibble: 5 x 3 

sex weight_kg weight_rank 

<chr> <db1> <db1l> 
1 Female 95 71 
2 Male 96 64 
3 Male 74 332: 
4 Female 70 408. 
5 Female 67 470. 


The output shows that the person in the first row is the 71st heaviest individual. 


Now, let’s try to write a grouped mutate () call. Imagine we want to add this weight rank 
column per sex group in the data frame. That is, we want to know each person's weight 
rank in their sex category. In this case, we can chain group by (sex) with mutate (): 


yao sex weight %>% 
group by(sex) %>% 
mutate (weight rank = rank (desc (weight_kg))) 


# A tibble: 5 x 3 
# Groups: sex [2] 
sex weight _ kg weight_rank 
<chr> <dbl> <dbl> 
1 Female 95 SERS, 
2 Male 96 13.5 
3 Male 74 148 
4 Female 70 220. 
5 Female 67 250. 


Now we see that the person in the first row is the 53rd heaviest woman. (The .5 indicates 
that this rank is a tie with someone else in the data.) 


We could also arrange the data to make things clearer: 


yao_sex weight %>% 
group by(sex) %>% 
mutate (weight rank = rank(desc(weight_kg))) %>% 
arrange (sex, weight _rank) 


## # A tibble: 5 x 3 


## # Groups: sex [1] 

++ sex weight kg weight rank 
## <chr> <db1> <dbl> 
## 1 Female 162 iL 
## 2 Female 161 2 


## 3 Female 158 3 
## 4 Female 135 4 
## 5 Female 129 5 


Mutating with nested groupings 


Of course, as with the other verbs we have seen, mutate () also works with nested 
groups. 


For example, below we create the nested grouping of age and sex with the yao data 
frame, then add a rank column with mutate (): 


yao S>% 
group by (sex, age category) %>% 
mutate (weight rank = rank (desc (weight_kg))) 
# A tibble: 5 x 8 
# Groups: sex, age category [4] 
sex age age_category weight_kg occupation 
<chr> <dbl> <chr> <dbl> <chr> 
1 Female 45 45 - 64 95 Informal worker 
2 Male 55 45 - 64 96 Salaried worker 
3 Male 23) 15 = 29 74 Student 
4 Female 20) 15: = 29 70 Student 
5 Female 55 45 - 64 67 Trader--Farmer 
# .. with 3 more variables: igg result <chr>, 
# igm result <chr>, weight rank <dbl> 


The output shows that the person in the first row is 20th heaviest woman in the 45 to 64 
age group. 


With the sarcopenia data, group by age_group, then in a new variable 
called grip strength rank, compute the per-age-group rank of each 

PRACTICE individual's grip strength. (To compute the rank, use mutate () and the 
rank () function with its default ties method.) 


(in RMD) 
# Complete the code with your answer: 
OReanikigraomsierencien<— 
sarcopenia 
WATCH OUT 


Remember to ungroup data before further analysis 


13 


WATCH OUT 


As has been mentioned before, it is important ungroup your data before 
doing further analysis. 


Consider this last example, where we computed the weight rank of 
individuals per age and sex group: 


yao S>% 
group _by(sex, age category) %> 


—~ oe 


mutate (weight rank = rank(desc(weight_kg) )) 
## # A tibble: 5 x 8 
## # Groups: sex, age category [4] 
ae sex age age category weight _kg occupation 
<chir> | <dbil><chir> <dbil> <chr> 

1 Female 45 45 - 64 95 Informal worker 
## 2 Male 35 45 = 64 96 Salaried worker 
## 3 Male 23) T= 29 74 Student 

4 Female 20 15 = 29 70 Student 

5 Female 55 45 = 64 67 Trader--Farmer 
## # .. with 3 more variables: igg result <chr>, 
## + igm result <chr>, weight rank <dbl1> 


If, in the process of analysis, you stored this output as a new data frame: 


yac mmodi triedi 
yao %>% 
group boy (sex, age category) o-r 
mutate (weight rank = rank(desc(weight_kg) )) 


And then, later on, you picked up the data frame and tried some other 
analysis, for example, filtering to get the oldest person in the data: 


yao modified %>% 
filter(age == max(age) ) 


## # A tibble: 5 x 8 

# Groups: sex, age category lo] 

sex age age category weight kg occupation 

ae <chr> <dbl> <chr> <db1l> <chr> 
## 1 Male 65 45 - 64 93 Retired 

2 Male 18 Goo + 95. Retired==Intormal 
wor... 
## 3 Male Irae eS 1A 44 Student 
## 4 Female 44 30 - 44 67 Home-maker 

5 Female VES WAGs 40 Retired 


. with 3 more variables: 


IgG result <chr>, 
LGmMereswikiu<onr>, weirghe rank <dl> 


You might be confused by the output! Why are there 55 rows of “oldest 


people”? 


This would be because you forgot to ungroup the data before storing it 


for further analysis. Let’s do this properly now 


yao modatk ed) <= 
yao %>% 
group_by(sex, age category) 


B>% 
( 


Q Q 
5> 


WATCH OUT mutate (weight rank = rank(desc(weight_kg) )) 
ungroup () 
Now we can correctly obtain the oldest person/people in the data set: 
yao modified %>% 
filter(age == max(age) ) 
## # A tibble: 2 x 8 
ae sex age age category weight kg occupation igg result 
<ehr> <dbl>  <cehr> <dbl> <chr> <chr> 
1 Female TO EGS 40 Retired Negative 
## 2 Female WONG Sa 81 Home-maker Negativ 
ft t a. With 2 more variables: igm result. <chr>, 
weight rank <dbl> 
Wrap up 


group by () is a marvelous tool for arranging, mutating, filtering based on the groups 


within a single or multiple variables. 


14 


arrange() 


GRO al ae 


ARRANGE group_by() —> arrange() 


— E 
Bona E - B 
EE uE, Ba 
ET 8 zE 
E eal 
E mutate() 
E E m 
BAS, RASE 
> 
eee ee 
GROUP_BY 
up_BY 7 = E 


MUTATE group_by() > mutate() 


ee ' a w] 
a — = 
es = E 
E E [z] E 
ae "gee > HEEE 
e] a f Bo 
a) E E m 
— filter() 
+t 
GROUP_BY 
a a m 


FILTER group_by() —> filter() 


N — 


| 
M 


Fig: filter() and group_by() 


There are numerous ways of combining these verbs to manipulate your data. We invite 
you to take some time and to try these verbs out in different combinations! 


See you next time! 


Contributors 


The following team members contributed to this lesson: 


(> LAURE VANCAUWENBERGHE 


Data analyst, the GRAPH Network 
A firm believer in science for good, striving to ally programming, health 
and education 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


References 


Some material in this lesson was adapted from the following sources: 


e Horst, A. (2022). Dplyr-learnr. https://github.com/allisonhorst/dplyr-learnr (Original 
work published 2020) 


¢ Group by one or more variables. (n.d.). Retrieved 21 February 2022, from https:// 
dplyr.tidyverse.org/reference/group_by.html 


e Create, modify, and delete columns — Mutate. (n.d.). Retrieved 21 February 2022, 
from https://dplyr.tidyverse.org/reference/mutate.html 


e Subset rows using column values — Filter. (n.d.). Retrieved 21 February 2022, from 
https://dplyr.tidyverse.org/reference/filter.html 


e Arrange rows by column values — Arrange. (n.d.). Retrieved 21 February 2022, from 
https://dplyr.tidyverse.org/reference/arrange.html 


Artwork was adapted from: 


¢ Horst, A. (2022). R & stats illustrations by Allison Horst. https://github.com 
/allisonhorst/stats-illustrations (Original work published 2018) 


Lesson notes | The across function 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


VEY ice Ach aie eas eke Sas ae we ee at E ale se a eeu E alate 
Learning ODJECUVES 6nd ai nea Be ees wes ES as hes A OED OES ODE REESE eh ewe 
Pee ae ee bad ares aa ee Woda e 544 eee sabe oh Hes eee oeaheg 64444-5084 ok 
Es se eh hae BS BA EAS AE He ee Ae es ERED ORG eT EGE eae 
Using across () Withmitete() is csc len dee cnr dredsnticgde tose sherds cae eepadeeds 
The cols SFQUINENE +6505 05 cry ene end one 4445 anti irei eke CR ERSOREERS ERTS EEG RD 
NNR, fasarguMENEe ser sena emerin e ord ede aed ea eee dope Hike Soe bch bod woe aes 
Custom (“anonymous Y TUNCHONS:. «244. 642.604 0ces seeded ou t.cetaciaaweecianaehe due de 
Creating new columns with the .names arquinient sse sresreereastennenag iiia aiats 
Using across) with Summa r1 eC) ceo cae s Oo eeu oe FEO Set REGRETS EE OES E ORE OH EOS 
Multiple summary Statistics ¢ sce bic de 8 S60 rontod 9463904 REDS SEDER CREE ESS 
eR os oe Oreo Oe ee ee cere ta ee ot ae a ee ee ea AA he ee oe 
MO ab ek £98 4-044 eee see oes des FESS de Bes ood oo oh) S hn 4a oud sade uk 


Intro 


In previous lessons, you learned how to perform a range of wrangling operations like 
filtering, mutating and summarizing. But so far, you only performed these operations one 
column at a time. Sometimes however, it will be useful (and efficient) to apply the same 
operation to several columns at the same time. For this, the across () function can be 
used. 


Let’s see how! 


summarize() 


a 

ras 

ai 

ACROSS Pa aie 
+r mutate() 
me + 


Fig: the across () verb. 


Learning objectives 


1. You can use across () with the mutate() and summarize () verbs to apply 
operations over multiple columns. 


2. You can use the .names argument within mutate (across()) to create new 
columns. 


3. You can write anonymous (lambda) functions within across () 


Packages 
This lesson will require the packages loaded below: 


if(!require(pacman)) install.packages ("pacman") 
pacman::p load(here, tidyverse) 


Datasets 


In this lesson, we will again use data from the COVID-19 serological survey conducted in 
Yaounde, Cameroon. 


yaounde <- read csv (here ("data/yaounde_ data.csv") ) 


yaounde <- yaounde %>% rename(age years = age) 


yaounde 


# A tibble: 5 x 53 
id date surveyed age years age category 
<chr> <date> <dbl> <chr> 

1 BRIQUETERIE 000 0001 2020-10-22 45 45 - 64 

2 BRIQUETERIE 000 0002 2020-10-24 55 45 - 64 

3 BRIQUETERIE 000 0003 2020-10-24 23) LS = 29 

4 BRIQUETERIE 002 0001 2020-10-22 20 15 - 29 

5 BRIQUETERIE 002 0002 2020-10-22 55 45 - 64 

# .. with 49 more variables: age category 3 <chr>, 

# sex <chr>, highest education <chr>, occupation <chr>, 


We will also use data from a hospital study conducted in Burkina Faso, in which a range of 
clinical data was collected from patients with febrile (fever-causing) diseases, with the 
aim of predicting the cause of the fever. 


febrile diseases <- read csv(here("data/febrile diseases burkina_faso.csv") ) 
febrile diseases 


# A tibble: 5 x 36 
age category sex pretreatment ma...* pretreatment_an...? 
<chr> <chr> <chr> <chr> 
1 5 years or old... 2 do not know currently taking 
2 5 years or old.. 2 no no 
3 5 years or old.. female currently taking no 
4 5 years or old.. female no currently taking 
5 5 years or old.. female no no 
# .. with 32 more variables: onset fever <dbl>, 
# abd pain <chr>, diarrhoea <chr>, runny nose <chr>, 


Finally, we will use data from a dietary diversity survey conducted in Vietnam, in which 
women were asked to recall (one or several days) the foods and drinks they consumed the 
previous day. 


diet <- read_csv(here("data/vietnam diet diversity.csv") ) 
diet <- diet %>% rename (household id = hhid) 


diet 


# A tibble: 5 x 45 
household id date of visit age _y age_group 
<dbl> <dttm> <dbl> <chr> 
278 2017-05-23 00:00:00 47 40-49 
348 2017-06-17 00:00:00 34 30-39 
354 2017-06-17 00:00:00 37 30-39 
324 2017-06-17 00:00:00 35 30-39 
209 2017-06-07 00:00:00 35 30-39 
. with 41 more variables: kilocalories consumed <dbl1>, 
water consumed grams <dbl>, 


HE FE OT OB WN PB 


Using across () with mutate () 


The mutate () function gives you an easy way to create new variables or modify in place 
variables. 


But sometimes you have a large number of columns to operate on, and typing out 
mutate () statements line-by-line can become onerous. In such cases across () can 
radically simplify and shorten your code. 


Let’s see an example. 


Consider the symptoms columns (from symp fever to symp stomach ache) inthe 
yaounde data frame: 


yao symptoms <-— 
yaounde %>% 
select (age years, sex, date surveyed, symp fever:symp stomach ache) 


yao symptoms 


# A tibble: 5 x 16 
age years sex date surveyed symp fever symp headache 
<dbl> <chr> <date> <chr> <chr> 

1 45 Female 2020-10-22 No No 

2 55 Male 2020-10-24 No No 

3 23 Male 2020-10-24 No No 

4 20 Female 2020-10-22 No No 

5 55 Female 2020-10-22 No No 

# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 


The 13 columns between symp fever and symp stomach ache indicate whether or not 
each respondent had a specific COVID-compatible symptom. 


Now, imagine you wanted to convert all these columns to upper case. (That is, “Yes” to 
“YES” and “No” to “NO”). How might you do this? Without across (), you would have to 
mutate the columns one by one, with the toupper () function: 


yao symptoms %>% 

mutate(symp fever = toupper(symp fever), 
symp headache = toupper(symp_ headache), 
symp cough = toupper(symp_ cough), 
Shale) Caaan eS = EOE (Eye selolalialabicaliss)) 7 
symp sneezing = toupper (symp sneezing), 
symp fatigue = toupper(symp fatigue), 
symp muscle pain = toupper (symp muscle pain) 
#... And on and on and on and on and on 


) 


# A tibble: 5 x 16 
age years sex date surveyed symp fever symp headache 
<dbl> <chr> <date> <chr> <chr> 

1 45 Female 2020-10-22 NO NO 

2 55 Male 2020-10-24 NO NO 

3 23 Male 2020-10-24 NO NO 

4 20 Female 2020-10-22 NO NO 

5 55 Female 2020-10-22 NO NO 

# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 


This is obviously not very time-efficient. An experienced data analyst who saw this code 
might scold you for not obeying the DRY (“Don’t Repeat Yourself”) principle of 
programming. 


But with the across () function, you have the power to do this in all of two lines: 


yao_symptoms %>% 
s(.cols = symp fever:symp stomach_ache, 
.fns = toupper) ) 


mutate (acros 


# A tibble: 5 x 16 
age_years sex date surveyed symp fever symp headache 
<dbl> <chr> <date> <chr> <chr> 

1 45 Female 2020-10-22 NO NO 

2 55 Male 2020-10-24 NO NO 

3 23 Male 2020-10-24 NO NO 

4 20 Female 2020-10-22 NO NO 

5 55 Female 2020-10-22 NO NO 

# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 

Amazing! 


Let’s break down the code above. We used across () inside of the mutate () function, 
and provided it with two main arguments: 


e .cols defined the columns to be modified. The symp fever:symp_ stomach ache 
code means “all columns between symp fever and symp stomach_ache’. 


e .fns defined the functions to apply on the selected columns. In this case, the 
toupper function was applied. 


And that’s the basic gist of across ()! But below we'll consider each of these arguments 
in a bit more detail. 


Why follow the DRY (Don’t Repeat Yourself) principle? 


There are many reasons to avoid repetitive code. Here are just a few: 


SIDE NOTE 
1. You'll save time in writing the code (obviously). 


2. You'll also save time in maintaining the code. This is because if you 
need to make a change (e.g. switch toupper to tolower), you 
won't need to make the same change in several places. You can fix 
it in a single place. 

3. DRY code is usually easier to read and understand, both by yourself 
and by others. 


The .cols argument 
Now let's look at the arguments of across () in some more detail. 


As mentioned above, the .cols argument of across () selects the columns to be 
modified. 


Most the different methods you have learned for selecting columns can be used here. 


One difference with the classical use of select () is that to list column names with 
across (), you must wrap them in c(): 


yao symptoms %>% 
mutate (across(.cols = c(symp fever, symp headache, symp cough), 
.fns = toupper) ) 


If, instead of c(symp_ fever, symp headache, symp cough) you just put in 
symp fever, symp headache, symp cough, you'll get an error: 


yao symptoms %>% 
MUbaEe(Aenoss (cols = syu cewe, Syaa headache; Syu coughs, Done CO ware 
.fns = toupper)) 


Error in `mutate () ` 
! Problem while computing `..1 = across(.cols = symp fever, symp_headache.... 


Other than that, the usual variable selection methods can be used here. 


So you can use numeric ranges, like 4:16: 


yao_symptoms %>% 
mutate (across(.cols = 4:16, 
.fns = toupper) ) 


# A tibble: 5 x 16 
age years sex date surveyed symp fever symp headache 
<dbl> <chr> <date> <chr> <chr> 

1 45 Female 2020-10-22 NO NO 

2 55 Male 2020-10-24 NO NO 

3 23 Male 2020-10-24 NO NO 

4 20 Female 2020-10-22 NO NO 

5 55 Female 2020-10-22 NO NO 

# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 


Or helper verbs like starts with(): 


yao symptoms %>% 
mütatelacross(Teolsi startoi vi chi (i syvmom i), 
.fns = toupper) ) 


# A tibble: 5 x 16 
age_ years sex date surveyed symp fever symp headache 
<dbl> <chr> <date> <chr> <chr> 

1 45 Female 2020-10-22 NO NO 

2 55 Male 2020-10-24 NO NO 

3 23 Male 2020-10-24 NO NO 

4 20 Female 2020-10-22 NO NO 

5 55 Female 2020-10-22 NO NO 

# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 


Or the function where () to select columns of a particular type: 


yao_symptoms %>% 
mutate (across(.cols = where(is.character), 
.fns = toupper) ) 


# A tibble: 5 x 16 
age_ years sex date surveyed symp fever symp headache 
<dbl> <chr> <date> <chr> <chr> 

1 45 FEMALE 2020-10-22 NO NO 

2 55 MALE 2020-10-24 NO NO 

3 23 MALE 2020-10-24 NO NO 

4 20 FEMALE 2020-10-22 NO NO 

5 55 FEMALE 2020-10-22 NO NO 

# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 


Or the catch-all everything (): 


yao_symptoms %>% 
mutate (across(.cols = everything(), 
.fns = toupper) ) 


# A tibble: 5 x 16 
age_years sex date surveyed symp fever symp headache 
<chr> <chr> <chr> <chr> <ehr> 

1 45 FEMALE 2020-10-22 NO NO 

2 D9 MALE 2020-10-24 NO NO 

3°23 MALE 2020-10-24 NO NO 

4 20 FEMALE 2020-10-22 NO NO 

5 55 FEMALE 2020-10-22 NO NO 

# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 


Note that everything () is the default value for the .cols. So the above code, is 
equivalent to simply running: 


yao _symptoms s> 
S 


mutate (across (.fns = toupper) ) 


## # A tibble: 5 x 16 
## age_years sex date surveyed symp fever symp headache 
## <chr> <chr> <chr> <chr> <chr> 
## 1 45 FEMALE 2020-10-22 NO NO 
## 2 55 MALE 2020-10-24 NO NO 
## 3 23 MALE 2020-10-24 NO NO 
## 4 20 FEMALE 2020-10-22 NO NO 
## 5 55 FEMALE 2020-10-22 NO NO 
## # .. with 11 more variables: symp cough <chr>, 
Ht # symp rhinitis <chr>, symp sneezing <chr>, 
In the febrile diseases dataset, the columns from abd pain to 
splenomegaly indicate whether a patient had a specified symptom, 
recorded as “yes” or “no”. 
feprileldisedsesik =o 
select (abd _pain:splenomegaly) 
## # A tibble: 5 x 18 
abd pain diarrhoea runny nose earpain throat ache cough 
PRACTICE ## <chr> <chr> <chr> <chr> <chr> <chic> 
## 1 yes yes no no no no 
## 2 yes yes no no no yes 
## 3 yes yes no no no yes 
(in RMD) 4 yes no yes no no yes 
## 5 yes no no no no yes 
i? a . With 12 more variables: productive cough <chr>, 
# dyspnoa <chr>, dysuria <chr>, myalgia <chr>, 


Use mutate () and across () to convert the variable levels to uppercase. 
(That is, “yes” to “YES” and “no” to NO”) 


Q febrile disease symptoms <- 
febrile diseases %>% 


10 


The .fns argument 


Now, on to the second argument in across (). As mentioned above, this argument takes 
in the function to be applied across columns. 


You can provide any valid function here. We had previously used toupper (): 


yao symptoms %>% 
mutate (across(.cols = symp fever symp stomach ache, 
.fns = toupper) ) 


# A tibble: 5 x 16 
age years sex date surveyed symp fever symp headache 
<dbl> <chr> <date> <chr> <chr> 

i 45 Female 2020-10-22 NO NO 

2 55 Male 2020-10-24 NO NO 

3 23 Male 2020-10-24 NO NO 

4 20 Female 2020-10-22 NO NO 

5 55 Female 2020-10-22 NO NO 

# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 


In a similar style, we can also use tolower (): 


yao symptoms %>% 
mutate (across(.cols = symp fever:symp stomach_ache, 
.fns = tolower) ) 


# A tibble: 5 x 16 
age years sex date surveyed symp fever symp headache 
<dbl> <chr> <date> <chr> <chr> 

1 45 Female 2020-10-22 no no 

2 55 Male 2020-10-24 no no 

3 23 Male 2020-10-24 no no 

4 20 Female 2020-10-22 no no 

5 55 Female 2020-10-22 no no 

# .. with 11 more variables: symp_cough <chr>, 

# symp_rhinitis <chr>, symp_sneezing <chr>, 


Of course, the function we apply through across () needs to be type-appropriate: it 
should apply to the type (character, numeric, factor, etc) of the variables we are feeding 
in. 

yao symptoms %>% 


mutate (across(.cols = symp fever:symp stomach ache; 
.fns = log)) 


Error in `mutate()`: 
! non-numeric argument to mathematical function 


Here we get an error message because we tried to apply a function made for numeric 
variables to character type variables. 


PS aS SSE PES SP SEE ESS EEE Ee ee | 


It is a bit confusing to write a function without parentheses, asin . fns = 
toupper. There is a difference between toupper() and toupper. 


SIDE NOTE 


toupper() calls the function, while toupper without parenthesis makes 
a reference to the function. With a reference to the function, across () 
will take care of calling it from its back-end code (the code that defines 
across (). We call it “back-end” because it’s “in the back” and you cannot 
see it unless you go looking into it explicitly.) 


LS a ee ee eS E O: SR 
et ee te ee en 


In the febrile diseases dataset, ensure that all the columns from 
abd_pain to splenomegaly, indicating symptoms of patients, are in 
lower case. Apply tolower() across all these variables using mutate () 
and across). 


PRACTICE 


(in RMD) ; ; 
0 teprile disease symptoms to loven <= 


rebmlkemeacseas s %> 


ae 


Custom (“anonymous”) functions 


Sometimes it is useful to use a custom function, called a “lambda function” or 
“anonymous function”. You will see more about functions in later lessons. The idea here is 
that you write your own operation which will be applied across your selected variables. 
The writing of these lambda functions has certain strict rules so pay attention to this as 
we go through several examples. 


The toupper example we saw above can be rewritten with this syntax: 


yao_symptoms %>% 
mutate (across(.cols = symp fever:symp stomach_ache, 


-fns = ~ toupper(.x))) 


## # A tibble: 5 x 16 
Ht age_ years sex date surveyed symp fever symp headache 


1 45 Female 2020-10-22 NO NO 
2 55 Male 2020-10-24 NO NO 
3 23 Male 2020-10-24 NO NO 
4 20 Female 2020-10-22 NO NO 
5 55 Female 2020-10-22 NO NO 
# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 

Inthe code .fns = ~ toupper(.x), the tilda, ~, introduces the lambda function, and the 


.x references each of the columns across which you are applying the function. The .x 
takes the columns one by one and “calls” the function on each one. 


So overall, this code can be read as “apply toupper () to each of the symptom variables.” 


Here is another example, but with tolower (): 


yao symptoms %>% 
mutate (across(.cols = symp fever:symp stomach_ache, 
-fns = ~ tolower(.x))) 


# A tibble: 5 x 16 
age years sex date surveyed symp fever symp headache 
<dbl> <chr> <date> <chr> <chr> 

1 45 Female 2020-10-22 no no 

2 55 Male 2020-10-24 no no 

3 23 Male 2020=10=24 no no 

4 20 Female 2020-10-22 no no 

5 55 Female 2020-10-22 no no 

# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 


The pattern is quite simple once you get used to it. 


Now, with this anonymous function syntax, it becomes very intuitive to use functions that 
take in multiple arguments. 


For example, we could explicit the “No” and “Yes” by pasting into the string what they are 
referring to, in other words, symptoms: 


yao symptoms %>% 
MUEAES (aerossi( colts) = svmpetever ssvmnpesizomach ache, 


Tns — = pastedi(ax, Symptoms Yh) 


## # A tibble: 5 x 16 


## age years sex date surveyed symp fever symp headache 
# # -<dbl> <chr> <date> <chr> <chr> 

## 1 45 Female 2020-10-22 No symptoms No symptoms 
## 2 55 Male 2020-10-24 No symptoms No symptoms 
## 3 23 Male 2020-10-24 No symptoms No symptoms 


## 4 20 Female 2020-10-22 No symptoms No symptoms 


## 5 55 Female 2020-10-22 No symptoms No symptoms 
## # .. with 11 more variables: symp cough <chr>, 
Ht # symp rhinitis <chr>, symp sneezing <chr>, 


Or we could use str_sub(), a function that allows to keep a subset of your string: 


yao symptoms %>% 
mutate (across(.cols = symp fever:symp_ stomach_ache, 


jms = e GLE SUO(ek, erari = id, enc = 1I) 


# A tibble: 5 x 16 
age_years sex date_surveyed symp fever symp headache 
<dbl> <chr> <date> <chr> <chr> 
45 Female 2020-10-22 
55 Male 2020-10-24 
23 Male 2020-10-24 
20 Female 2020-10-22 
55 Female 2020-10-22 N 
. with 11 more variables: symp cough <chr>, 
symp rhinitis <chr>, symp sneezing <chr>, 


22224 


222224 


Hk dk OT BS WN FE 


In our case our string values are “No” and “Yes” so we will make a substring with just their 
first letters (“N” and “Y") to have a one letter encoding. 


Or we can recode the “Yes” and “No” entries in a different manner: 


yao symptoms %>% 
mutate (across(.cols = symp fever:symp stomach_ache, 
puis SF ae ule (a = os Akis Syanideouill, VDexSts) intone Meyve 
symptom") ) ) 


# A tibble: 5 x 16 

age years sex date surveyed symp fever 

<dbl> <chr> <date> <chr> 

1 45 Female 2020-10-22 Does not have symptom 
2 55 Male 2020-10-24 Does not have symptom 
3 23 Male 2020-10-24 Does not have symptom 
4 20 Female 2020-10-22 Does not have symptom 
5 55 Female 2020-10-22 Does not have symptom 
# .. with 12 more variables: symp headache <chr>, 
# symp cough <chr>, symp rhinitis <chr>, 


Now we have the “Yes” encoded as “Has symptom” and the “No” encoded as “Does not 
have symptom”. These strings are longer but they are clearer in their meaning than just 
“Yes” vs. “No”. 


We could also recode the “Yes” and “No” to numeric values: 


14 


yao symptoms %>% 
mutate (across(.cols = symp fever:;symp stomach ache, 
pis) = alr elge ls == Mees’, Ip ZII) 


# A tibble: 5 x 16 


age years sex date surveyed symp fever symp headache 
~<dbl> <chr> <date> <db1> <db1> 
1 45 Female 2020-10-22 2 2 
2 55 Male 2020-10-24 2 2 
3 23 Male 2020-10-24 2 2 
4 20 Female 2020-10-22 2 2 
5 55 Female 2020-10-22 2 2 
# .. with 11 more variables: symp cough <dbl>, 
# symp rhinitis <dbl>, symp sneezing <dbl>, 
Now we have “Yes” encoded as choice 1 and “No” as 2. 
Note that you can chain several mutate () calls together: 
yao symptoms %>% 
mutate (across(.cols = symp fever:symp stomach_ache, 
pubis) S alae GSS (ook == Mikes, I, ZII S7: 
MUEdES (across (icols) = sivmpitever ssivmp stomach ache, 
minis) =) “7 Cae wieni ens = mdb 
== 2 O))) 
# A tibble: 5 x 16 
age years sex date surveyed symp fever symp headache 
“<dbl> <chr> <date> <db1> <db1> 
1 45 Female 2020-10-22 0 0 
2 55 Male 2020-10-24 0 0 
3 23 Male 2020-10-24 0 0 
4 20 Female 2020-10-22 0 0 
5 55 Female 2020-10-22 0 0 
# .. with 11 more variables: symp cough <dbl>, 
# symp rhinitis <dbl>, symp sneezing <dbl>, 


Above, we first convert from “Yes” & “No” to the numeric values 1 and 2, to follow R 
indexing (R counts from 1 onwards). However, other programming language start their 
index at 0, such as Python (Python counts from 0 onwards). For many machine learning 
algorithms, your encoding should be O or 1, so this could be a useful conversion of the 
encoding of your data. We use case_when() to define which numerical value should be 
switched to 1 (TRUE, has symptoms) and which numerical value should be switched to 0 
(FALSE, does not have symptoms). 


PRACTICE 
A 


had a specified symptom, recorded as “yes” or “no”. 


PRACTICE Use mutate(), across () and an anonymous function to convert the 
variable levels to numbers, with “yes” as 1 and “no” as O. 


(in RMD) Ol teprrle  diseaselsynptoms to mumerie <= 
febrile diseases %>% 


In the diet dataset, the columns from retinol to zinc give the number 
of milligrams of each nutrient consumed by the surveyed women ina 
day. 


diet %>% 
select (retinol:zinc) 


t eA tibbles 5 x 15 

4 retinol alpha icatorene beta catorene beta cryptoxant ham 
<db1l> <db1> <dbl> <db1> 

PRACTICE 1 0.998 0.0273 1.58 0.141 
ah a2 2205 0.114 3. 66 0.0804 

t3 Mey 01.388 ESE Oly 

(in RMD) 4 1.08 0.0305 IO) 0.509 
5 dae 0 102 9.66 0.164 

# # .. with 11 more variables: vitamin_c <dbl>, 
## # vitamin D2 <dbl>; vitamin b3 <dbl>; vitamin bo <dbl>; 


Use mutate (), across () and an anonymous function to convert these 
values to grams (divide by 1000). 


O elis lee) Ciceline; <= 
diet %>% 


Creating new columns with the .names argument 
The examples we have seen so far all involved replacing existing columns. 
But what if you want to create new columns instead? 


To illustrate this, let's create a smaller subset of yao symptoms: 


yao symptoms mini <- 
yao symptoms %>% 
select (symp fever, symp headache, symp cough) 


yao symptoms mini 


# A tibble: 5 x 3 
symp fever symp headache symp cough 
<chr> <chr> <chr> 

1 No No No 

2 No No No 

3 No No No 

4 No No No 

5 No No No 


Now, to convert all columns to uppercase, we would usually run: 


yao symptoms mini s>% 
mutate (across (.fns = toupper) ) 


# A tibble: 5 x 3 
symp fever symp headache symp cough 
<chr> <chr> <chr> 

1 NO NO NO 

2 NO NO NO 

3 NO NO NO 

4 NO NO NO 

5 NO NO NO 


This code modifies existing columns /n place 


(Remember that the default argument for .cols is everything (), so the above code 
modifies all columns in the dataset) 


Now, if we instead want to make new columns that are uppercase, we can use the .names 
argument of across () 


yao symptoms mini %>% 
mutate (across (.fns = toupper, 
mames — Woh xC@ul)t Uolercneercisksy 4) ) 


# A tibble: 5 x 6 
symp fever symp headache symp cough symp fever _uppercas 
<chr> <chr> <chr> <chr> 

1 No No No NO 

2 No No No NO 

3 No No No NO 

4 No No No NO 

5 No No No NO 


## # .. with 2 more variables: symp headache uppercase <chr>, 
Ht # symp cough uppercase <chr> 


{.col} represents each of the old column names. The rest of the string. “_uppercase” is 
pasted together with the old column names. So the code "{.col} uppercase" code can 
be read as “for each column, convert to uppercase and name it by pasting the existing 
lowercase column name with _uppercase.” 


Of course, we can input any arbitrary string: 


yao symptoms mini %>% 
mutate (across (.fns = toupper, 
aene = VE ool]: JEU) IERS) 


# A tibble: 5 x 6 
symp fever symp headache symp cough symp fever BIG LETTERS 
<chr> <chr> <chr> <chr> 

1 No No No NO 

2 No No No NO 

3 No No No NO 

4 No No No NO 

5 No No No NO 

# .. with 2 more variables: symp headache BIG LETTERS <chr>, 

# symp cough BIG LETTERS <chr> 


If we want the text to come before the old column name, we can also do this: 


yao symptoms mini %>% 
mutate (across (.fns = toupper, 
mames — Mibhsyexsuecelsis: af atetell |) 


# A tibble: 5 x 6 
symp fever symp headache symp cough uppercase symp fever 
<chr> <chr> <chr> <chr> E E 

1 No No No NO 

2 No No No NO 

3 No No No NO 

4 No No No NO 

5 No No No NO 

# .. with 2 more variables: uppercase symp headache <chr>, 

# uppercase symp cough <chr> 


More usefully, we can create a numeric version of these symptoms variables: 


yao symptoms mini %>% 
murare (EKNCOES (ies) = = ale else a == Westy Il, why 
ci@enietsy = HNOS A COL) 


## # A tibble: 5 x 6 


. with 2 more variables: numeric _symp headache <dbl>, 
numeric symp cough <dbl> 


#4 symp fever symp headache symp cough numeric symp fever 
<chr> <chr> <chr> <dbl> 
1 No No No 0 
## 2 No No No 0 
## 3 No No No 0 
## 4 No No No 0 
5 No No No 0 
# . 
# 


Now you will convert again the columns from abd pain to 

splenomegaly inthe febrile diseases dataset, on patient symptoms, 
PRACTICE into numerical values. But, you will create new columns named 

numeric abd pain to numeric splenomegaly using the .names 

argument within across (). 


(in RMD) 


Q febrile disease symptoms to numeric new variables <- 
Eebrmwlogdiscasiesm crc 


Using across () with summarize () 


To get summary statistics over multiple variables it is often helpful to use across (). 


Consider again the columns from retinol to zinc in the diet dataset, which indicate 
the number of milligrams of each nutrient consumed by surveyed Vietnamese women in a 
day: 


diet %>% 
select (retinol: zinc) 


## # A tibble: 5 x 15 


retinol alpha _catorene beta_catorene beta_cryptoxanthin 
<db1> <db1> <db1> <db1> 

## 1 0.998 0.0273 1.58 0.141 
tt 2 3.53 0.114 3.66 0.0804 
3 1.32 0.388 13.9 0.0117 

4 1.08 0.0305 10.9 0.509 

## 5 1.25 0.102 9.66 0.164 

## # .. with 11 more variables: vitamin _c <dbl>, 

# vitamin b2 <dbl>, vitamin_b3 <dbl>, vitamin _ b6 <dbl>, 


Imagine you wanted to find the average amount of each nutrient consumed 


the usual way, you would need to type: 


diet 


summarize (mean retinol = 


tt 
Ht 
Ht 
tt 
tt 
Ht 


S>% 
mean (retinol) 
mean_alpha_catorene = 
mean betal catoren 


meen yo tenia G = 
mean vitamin b2 = 


r 


mean(alpha catorene)i, 
= mean(beta_catorene), 

mean(vitamin c), 

mean (vitamin b2) 


# And on and on and on for 15 columns 


) 


# A tibble: 1 x 5 
mean retinol mean alpha catorene mean beta catorene 
~ <dbl> 7 ~  <dbl> <dl> 
1 2.06 0.152 opm ls) 


# .. with 2 more variables: mean _vitamin_c <dbl>, 


# mean vitamin b2 <dbl1> 


Of course this is not very efficient. 


But with across (), this can be done in just two lines: 


diet 


summarize (across(.cols = 


tt 
tt 
Ht 
Ht 
tt 
tt 


S>% 
per akinyopll, 8 zine, 


-fns = mean) ) 


# A tibble: 1 x 15 


retinol alpha_catorene beta_catorene beta_cryptoxanthin 


<db1> 
1 2.06 


<db1> <db 


1> <db1> 
0152 6. i 


5 0.210 


# .. with 11 more variables: vitamin _c <dbl>, 
# vitamin b2 <dbl>, vitamin_b3 <dbl>, vitamin _ b6 <dbl>, 


. To do this 


And recall that one of the primary benefits of summarize () is that it facilitates grouped 
summaries. Well, we can still use those here! 


diet 
gr 


summarize (across(.cols = 


tt 
tt 
Ht 
Ht 
tt 


b>% 
oup_by (age_group) 


Q Q 
KO 


perrot izane, 


.fns = mean)) 
# A tibble: 4 x 16 
age group retinol alpha catorene beta_catorene 
<chr> <db1> <db1> <db1> 
1 20-29 Zed 0.130 Deal 
2 30=39 2.92 0.164 6.18 


20 


## # .. with 12 


more variables: beta_cryptoxanthin <dbl>, 


## # vitamin c <dbl>, vitamin b2 <dbl>, vitamin b3 <dbl>, 


Beautiful! So much information extracted so easily. 


Here we grouped the data by age group, then across all the nutrient variables, we 
calculated their mean by age group. It can be read as: “for the 40-49 years old age group, 
the mean consumption of retinol is roughly of 1.343 micrograms, which seems lower than 
for other age groups.” 


Let’s see another example. 


The columns from is drug _ parac tois drug other in the yaounde dataset indicate, 
as 1 or O, whether or not a survey respondent was treated with the named drug: 


yaoldrugsi <= 
yaounde %>% 


salect (COGE veere, Sex Cars Bithaweyciel, Le (hoibke; jotslicee!s alls ioliaibley OTe) 


yao drugs 


# A tibble: 5 x 12 
age years sex date surveyed is drug parac 
<dbl> <chr> <date> <db1> 
1 45 Female 2020-10-22 1 
2 55 Male 2020-10-24 NA 
3 23 Male 2020-10-24 NA 
4 20 Female 2020-10-22 0 
5 55 Female 2020-10-22 NA 
# .. with 8 more variables: is _drug_antibio <dbl>, 
# is drug_ hydrocortisone <dbl>, 


How could we count the number of respondents who took each drug? 


We can simply take the sum of each column selecting the columns intelligently and using 
the sum() function: 


o 


yao drugs %>% 


sunus ze! (acrose l cols = crerde e vabiele (abs) hele" )) 


-fns = sum) ) 


## # A tibble: 1 x 9 

Ht is drug parac is drug_antibio is drug hydrocortisone 
tt <db1> <db1> <db1> 
## 1 NA NA NA 


## # .. with 6 more variables: is drug other anti inflam <dbl>, 


tH tt is drug 


21 


antiviral <dbl>, is drug chloro <dbl>, 


Oh no! we get all NAs! 


We were smart and selected all our columns using starts with() but we forgot to 
consider that sum() has na. rm set to FALSE by default. We need to ensure the na. rm 
argument is set to TRUE. 


The best way to do this is with lambda/anonymous function syntax: 


fo) 


yao drugs %>% 
Stimmen PAS (ACrogE l Cola = Sherlisiets! Arne Cee) 


$ 
-fns = ~ sum(.x, na.rm = TRUE) )) 


## # A tibble: 1 x 9 


Ht is drug parac is drug _antibio is drug hydrocortisone 
tt <db1> <db1> <db1> 
## 1 162 79 14 


## # .. with 6 more variables: is drug other anti inflam <dbl>, 
tH # is drug antiviral <dbl>, is drug chloro <dbl>, 


Again, we could also create a grouped summary: 


yao drugs %>% 
group Dy (Sex) %>% 
sümmarizel(across colsi starts) wath ("asiidmug') 


$ 
eS = ee SN e ea = EE 


# A tibble: 2 x 10 


sex is drug parac is drug_antibio is_ drug hydrocortis..! 
<chr> <db1> <db1> <db1> 
Female 93 42 7 
Male 69 37 7 


. with 6 more variables: is drug other anti inflam <dbl>, 


He HE NO 


is drug antiviral <dbl>, is drug chloro <dbl>, 


This last code chunk counts the number of individuals, per sex (group by sex), who have 
received each drug (summing the number of people across each drug variable). 


A final example. 


Recall that the 13 columns from symp fever to symp stomach ache inthe 
yao symptoms dataset indicate whether or not each respondent had a specific COVID- 
compatible symptom: 


yao symptoms 


## # A tibble: 5 x 16 
## age years sex date surveyed symp fever symp headache 


22 


<dbl> <chr> <date> <chr> <chr> 

1 45 Female 2020-10-22 No No 
2 55 Male 2020-10-24 No No 
3 23 Male 2020-10-24 No No 
4 20 Female 2020-10-22 No No 
5 55 Female 2020-10-22 No No 
# .. with 11 more variables: symp cough <chr>, 

# symp rhinitis <chr>, symp sneezing <chr>, 


How would we count the number of people with each symptom using across (). 
We have two options. 


Option 1: We could first mutate () the “Yes” and “No” to numeric values: 


yao symptoms %>% 
mutate (across(.cols = symp fever:symp stomach_ache, 
cums = 4 she eleal == “acs, il, )))) 


# A tibble: 5 x 16 
age_ years sex date surveyed symp fever symp headache 
<dbl> <chr> <date> <db1l> <db1l> 
45 Female 2020-10-22 0 

55 Male 2020-10-24 0 

23 Male 2020-10-24 0 
0 

0 


20 Female 2020-10-22 

55 Female 2020-10-22 
. with 11 more variables: symp cough <dbl>, 
symp rhinitis <dbl>, symp sneezing <dbl>, 


(ee S SP Ti 2 Tae 


He Sk Ol BS WN FE 


And then use sum() within summarize (): 


yao symptoms %>% 


mutate (across(.cols = symp fever:symp stomach_ache, 
pS) = alae Ul (ae == Mico’. Il, O sete 
summarize (across(.cols = symp fever:symp stomach_ache, 
SES = S UIT) ) 


## # A tibble: 1 x 13 

## symp fever symp headache symp cough symp rhinitis 
Ht <db1> <db1> <db1> <db1l> 
## 1 143 135 130 89 
## # .. with 9 more variables: symp sneezing <dbl>, 

Ht # symp fatigue <dbl>, symp muscle pain <dbl>, 


Option 2: we could jump directly to summarize (), by summing with a condition: 


yao_ symptoms %>% 
summarize (across(.cols = symp fever:symp stomach_ache, 


23 


## # A tibble: 1 x 13 
Ht symp fever symp headache symp cough symp rhinitis 


<int> <int> <int> <int> 


143 13,5) 130 89 


## # .. with 9 more variables: symp sneezing <int>, 
Ht # symp fatigue <int>, symp muscle pain <int>, ... 


This code can be read as: across each symptom column, sum all individuals who have been 
recorded as receiving that drug (who have a “Yes” data entry for that drug). 


PRACTICE 


(in RMD) 


PRACTICE 


(in RMD) 


PRACTICE 


(in RMD) 


In the diet data set, the variables fao_fgw1 to fao_fgw21 record the 
number of calories consumed from different FAO food groups. (FAO 
stands for “Food and Agricultural Organization’; the food groups are 
shown in Appendix 1.). 


Use summarize() and across () to calculate the mean amount of 
calories obtained from each good group. 


OPH Ey HA@eme anm<— 
diet %>% 


In the febrile diseases data set, the columns from abd pain to 
splenomegaly inthe febrile diseases dataset contain information 
on whether a patient had a specified symptom, recorded as “yes” or “no”. 
Use summarize(), across() to count the number of people with each 
symptom. 


ORFebralerousecacsersympEonsmeounta<— 
tebrileldisedses! s>5 


In the yaounde data set, calculate the median for the age, height, weight, 
number of bedridden days and numer of days off from work (i.e. from 
the variable age_years to the variable n bedridden_days) 


Use summarize() and across (), giving the .fns argument a lambda 
function to calculate the median. Careful ! A lambda function with the 


24 


right arguments is indispensable, else you will have an NA median for 


PRACTICE some of the variables. 


Q yaounde median <- 


F yaounde %>% 
(in RMD) 


Multiple summary statistics 


When we explored summarize () we rejoiced with the fact that we could calculate 
multiple summary statistics at the same time. This is also possible within across (). 


Coming back to the diet data survey from Vietnam, we could calculate both the mean and 
the median across all the nutrient variables: 


diet %>% 
summarise (across(.cols = retinol:zinc, 
.fns = list(mean = mean, 
median = median) )) 


## # A tibble: 1 x 30 


Ht retinol mean retinol median alpha_catorene mean 
++ <db1> <db1> <db1> 
## 1 2.06 0.724 0.152 


## # .. with 27 more variables: alpha _catorene median <dbl>, 
## # beta _catorene mean <dbl>, beta _catorene median <dbl>, 


Here it is clear that on all numeric type variables of the data set, we want to calculate the 
mean and the median. We can do so by providing the . fns argument of across () witha 
list. 


SIDE NOTE 


Small joke: . fns isn't “functions” abbreviated plural for nothing ! If we 
could only apply one function within across (), it would have been 
named . fn (function singular abbreviated). 


This time, for the naming, across () takes care of naming the resulting summary statistic 
columns. The syntax is: List (desired name 1 = function 1, desired name 2 = 


function 2). 


Let's see how you could control the naming even more, when operating on a list of 
functions: 


diet %>% 


summarise (across(.cols = retinol:zinc, 
.fns = list (average = mean, median = median), 
manesi n ood") )) 


## # A tibble: 1 x 30 

Ht average retinol median _retinol average alpha _catorene 
tt <db1> <db1> <db1> 
## 1 2.06 0.724 0.152 
## # . with 27 more variables: median _alpha_catorene <dbl>, 
Ht # average beta _catorene <dbl>, 


Here we reference the name of the function using {.fn} and the name of the column 
with {.col}. It is important to note that both abbreviations are singular! They are 
singular because they reference the function and the column one by one. Within the 
across () procedure, across () takes the functions and the columns one by one and for 
each one, takes the function name, such as average, and the column name, such as 
retinol, and makes the summary Statistic average retinol (i.e. {.fn}=average and 
{.col}=retinol). 


As we are discussing mean, median, standard deviation calculations, we have to anticipate 
for NA values. Consider the code below: 


diet %>% 
summarise (across(.cols = retinol:zinc, 
-fns = list(average = ~ mean(.x, na.rm = TRUE), 
median = ~ median(.x, na.rm = TRUE)), 
aema = Mal oie} toed PD) 


## # A tibble: 1 x 30 

Ht average retinol median _retinol average alpha _catorene 
tt <db1> <db1> <db1> 
## 1 2.06 0.724 0.152 
## # .. with 27 more variables: median _alpha_catorene <dbl>, 
Ht # average beta catorene <dbl>, 


Here we have the same code as above, except we ensure that none of the means or 
medians will be NA by adding the na. rm=TRUE argument to the functions. For this, as we 
have seen above, we need to use the lambda/anonymous function style. Here we are 
giving the . fns argument a list of lambda functions. 


jis a In the diet data set, calculate the mean and the standard deviation for 
kilocalories, water, carbohydrates, fat, and protein consumed (i.e. from 
the variable kilocalories consumed to the variable 

(in RMD) = 


carbs consumed_grams) 


26 


Use summarize() and across (), giving the .fns argument a list of the 
desired summary Statistics. Make sure your means are named 

PRACTICE COLUMN NAME mean and your standard deviations are named 

COLUMN NAME sd. 


(in RMD) Q diet food composicion mean sd <- 
diet %>% 


In the febrile diseases data set, calculate the mean and the standard 
deviation for white blood cells,and all other blood analysis measurements 
(i.e. from the variable wbc to the variable relymp_a, seeing Appendix 2 
for the detailed names of the variable name abbreviations) 


Use summarize() and across (), giving the . fns argument a list of the 


PRACTICE desired summary statistics. Careful ! You need to give a list of lambda 
functions to calculate the mean and standard deviation, paying attention 
to the na. rm arguments, else your summary Statistics will be set to NA. 

(in RMD) 

Make sure your means are named COLUMN NAME mean and your standard 
deviations are named COLUMN NAME sd. 
OPEebriverawvseasessmean bl oodicomposiuion <— 
febrile diseases %>% 
Recap ! 


across () can be used inside many different {dplyr} verbs: 
e mutate (across (multiple columns, function(s) to apply) ) 


e summarize (across (multiple columns, function(s) to apply) ) 


The statement defining multiple columns can be: 
e alist of namese.g.c(symp fever, symp headache, symp cough) 


e arange of names e.g. retinol:zinc 


The function(s) to apply across all columns can be: 


e an existing function of R (such as as. factor, mean etc.) 


e a custom (lambda/anonymous) function 
e alist of existing functions (Such as list (mean = mean, sd = sd)) 


e alist of custom (lambda/anonymous) functions 


Wrap up! 


This was your first approach to across (): congrats for making it through ! Remember 
the power of combination of across () and other verbs. If you feel a summarizing or 
mutation operation is identical for more than one variable, then usually you should think 
of using across (). 


In the upcoming lessons we will see some more data wrangling verbs: see you soon ! 


Contributors 


The following team members contributed to this lesson: 


LAURE VANCAUWENBERGHE 


A 

S 

Q Data analyst, the GRAPH Network 
A firm believer in science for good, striving to ally programming, health 
and education 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


References 


Some material in this lesson was adapted from the following sources: 


«e Summarise each group to fewer rows. (n.d.). Retrieved 21 February 2022, from 
https://dplyr.tidyverse.org/reference/summarize.html 


« Create, modify, and delete columns — Mutate. (n.d.). Retrieved 21 February 2022, 
from https://dplyr.tidyverse.org/reference/mutate.html 


« Apply a function (or functions) across multiple columns — Across. (n.d.). Retrieved 
21 February 2022, from https://dplyr.tidyverse.org/reference/across.html 


Artwork was adapted from: 


¢ Horst, A. (2022). R & stats illustrations by Allison Horst. https://github.com 
/allisonhorst/stats-illustrations (Original work published 2018) 


Appendix 1: FAO Food Groups 


code 
fao_fgw1 
fao_fgw2 
fao_fgw3 
fao_fgw4 
fao_fgw5 
fao_fgw6 
fao_fgw/ 
fao_fgw8 
fao_fgw9 
fao_fgw10 


fao_fgw11 


fao_fgw12 
fao_fgw13 
fao_fgwl14 
fao_fgw15 
fao_fgw16 
fao_fgw17 
fao_fgw18 
fao_fgw19 
fao_fgw20 
fao_fgw21 


meaning 

Consumed amount from Foods made from grains 
Consumed amount from White roots and tubers and plantain 
Consumed amount from Pulses 

Consumed amount from Nuts and seeds 

Consumed amount from Milk and milk products 
Consumed amount from Organ meat 

Consumed amount from Meat and poultry 
Consumed amount from Fish and seafood 
Consumed amount from Eggs 

Consumed amount from Dark green leafy vegetables 


Consumed amount from Vitamin A-rich vegetables, roots and 
tubers 


Consumed amount from Vitamin A-rich fruits 

Consumed amount from Other vegetables 

Consumed amount from Other fruits 

Consumed amount from Insects and other small protein foods 
Consumed amount from Other oils and fats 

Consumed amount from Savoury and fried snacks 

Consumed amount from Sweets 

Consumed amount from Sugar sweetened beverages 
Consumed amount from Condiments and seasonings 
Consumed amount from Other beverages and foods 


Appendix 2: Blood sample composition 


Abbreviation Complete Name 


WBC white bloodcell 

RBC red bloodcell 

HGB hemoglobin 

PLT platelet 

NEUT_A neutrophils 

LYMP_A lymphocytes 

MONO_A monocytes 

EOSI_A eosinophils 

BASO_A basophils 

NRBC_A nucleated red blood cells 
IG_A immature granulocytes 
RET_A reticulocytes 

ASLYMP_A — antibody-synthesizing lymphocytes 
RELYMP_A reactive lymphocytes 


30 


Lesson notes | Pivoting data 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


RM 644 hh ee A eee Se we in es AA a ae ek ae eu eels yd aan aoe 
Learning ODJeCHVES caus ai ala ee rea OES AES GG Oh EE Od we RE OOS OE ROG oe ee 
Pere. at5 34 ibe Sageapaw aes em dage 24-4 eee sabeak 6s hee aeahes 64444-4064 a8 
Wiel 46 WIG SNe long MEAN? oi. bce r ae wad ad ve awe bee hes Mea RSG EOD ORES RED REM S a 
When should you use wide vs long data? .............. 0.00 cee eee 
Pivoting wide tO [ONO ses ice cis Cae a Re eo ee eS Oe BS re Me wees RS ee ew Ee 
Pivotinglong tO NS es op ai eae rans Sees Hse SA OE ES a eA Ae me Bee Nae dae a 
Why 15: 16n@ data better for analysis? 42444 4o45004-095- 445-25 rti ERE 54S EH EA REE OG SOE 

Filtering grouped Gata... ..ec cs cc ee ea dawad eee eee be DOE DEE RHE DDE REE SESE OT eee 

Summarizing Grouped Gata... ccc sve sweden dee ewd ened see eh eee enbivanbeeeee va 

POCUNG eerde aniani he PAGERS ORES Rd RAGE A GREE e RhE ER ER A Oe hae a ae 
Pivoting cam De WORE aea bend dnd edeeaiad d aieg he Mth Ri eS dS ROSES 5494458 O49 24 
VOD UD and, ek 6 Pid ade bcd Ra rh RG BG 4G OE Ae dw ee PEA RG a eee ot ode chd 


Intro 


Pivoting or reshaping is a data manipulation technique that involves re-orienting the rows 
and columns of a dataset. This is sometimes required to make data easier to analyze, or 
to make data easier to understand. 


In this lesson, we will cover how to effectively pivot data using pivot longer() and 
pivot wider () from the tidyr package. 


Learning Objectives 


e You will understand what wide data format is, and what long data format is. 


e You will know how to pivot long data to wide data using pivot long() 


e You will know how to pivot wide data to long data using pivot wider () 


e You will understand why the long data format is easier for plotting and wrangling in 
R. 


a 


Packages 


# Load packages 
if (!require(pacman)) install.packages ("pacman") 
pacman::p load(tidyverse, outbreaks, janitor, rio, here, knitr) 


What do wide and long mean? 


The terms wide and long are best understood in the context of example datasets. Let’s 
take a look at some now. 


Imagine that you have three patients from whom you collect blood pressure data on 
three days. 


You can record the data in a wide format like this: 


blood_pressure_day_1 | blood_pressure_day_2 blood_pressure_day 3 
A 


110 112 114 
B 120 122 124 
Cc 100 104 105 


Fig: wide dataset for a timeseries of patients. 


Or you could record the data in a long format as so : 


[patient |day | blood_pressure_ 
A 1 110 
A 2 112 
A 3 114 
B 1 120 
B 2 122 
B 3 124 
Cc 1 100 
Cc 2 104 
Cc 3 105 


Fig: long dataset for a timeseries of patients. 


Take a minute to study the two datasets to make sure you understand the relationship 
between them. 


In the wide dataset, each observational unit (each patient) occupies only one row. And 
each measurement. (blood pressure day 1, blood pressure day 2...) is in a separate column. 


In the long dataset, on the other hand, each observational unit (each patient) occupies 
multiple rows, with one row for each measurement. 


Here is another example with mock data, in which the observational units are countries: 


country metric 


Fig: long dataset where the unique observation unit is a country. 


yr1960 | yri970 | yr2010 
oo oB o 
30 33 35 


country 


Fig: the equivalent wide dataset 


The examples above are both time-series datasets, because the measurements are 
repeated across time (day 1, day 2 and so on). But the concepts of long and wide are 
relevant to other kinds of data too, not just time series data. 


Consider the example below, showing the number of patients in different units of three 


hospitals: 
Maternity unit Intensive care unit 


Hospital A 4 2 
Hospital B 5 2 
Hospital C 6 3 


Fig: wide dataset, where each hospital is an observational unit 


Hospital Unit Num. of patients 


Hospital A Maternity 4 
Hospital A Intensive care 2 
Hospital B Maternity 5 
Hospital B Intensive care 2 
Hospital C Maternity 6 
Hospital C Intensive care 3 


Fig: the equivalent long dataset 


In the wide dataset, again, each observational unit (each hospital) occupies only one row, 
with the repeated measurements for that unit (number of patients in different rooms) 
spread across two columns. 


In the long dataset, each observational unit is spread over multiple lines. 


LG GS A <A A A a a 


The “observational units”, sometimes called “statistical units” of a dataset 
VOCAB are the primary entities or items described by the columns in that 
A dataset. 


In the first example, the observational/statistical units were patients; in 
the second example, countries, and in the third example, hospitals. 


ll Ee le | 


es ed 


Consider the mock dataset created below: 


temperatures 
data.frame ( 


PRACTICE country = c ("Sweden", "Denmark", Norway"), 
avgtemp.1994 = 1:3, 
avgtemp.1995 = 3:5, 

(in RMD) avgtemp.1996 = 5:7) 


temperatures 


## country avgtemp.1994 avgtemp.1995 avgtemp.1996 
## 1 Sweden 1 3 5 


## 2 Denmark 2 4 6 
## 3 Norway 8 5 @ 


PRACTICE A 3 ; 
Is this data in a wide or long format? 


(in RMD) # Enter the string "wide" or the string "long" 
7 ASSign, your answer to the object O data type 
Okdatantype <a z 
# Then run the provided CHECK function 


When should you use wide vs long data? 


The truth is: it really depends on what you want to do! The wide format is great for 
displaying data because it’s easy to visually compare values this way. Long data is best for 
some data analysis tasks, like grouping and plotting. 


It will therefore be essential for you to know how to switch from one format to the other 
easily. Switching from the wide to the long format, or the other way around, is called 
pivoting. 


Pivoting wide to long 


To practice pivoting from a wide to a long format, we'll consider data from Gapminder on 
the number of infant deaths in specific countries over several years. 


SIDE NOTE 


Gapminder is a good source of rich, health-relevant datasets. You are 
encouraged to peruse their collections. 


ee ee a 


Below, we read in and view this data on infant deaths: 


infant deaths wide <- read_csv (here ("data/gapminder infant _deaths.csv") ) 
infant deaths wide 


## # A tibble: 5 x 7 


Ht country *2010 x2011 x2012 x2013 «2014 x2015 
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl1> 
## 1 Afghanistan 74600 72000 69500 67100 64800 62700 
## 2 Angola 79100 76400 73700 71200 69000 67200 


## 3 Albania 420 384 354 331 313 301 
## 4 United Arab Emirates 683 687 686 681 672 658 
## 5 Argentina 9550 9230 8860 8480 8100 7720 


We observe that each observational unit (each country) occupies only one row, with the 
repeated measurements spread out across multiple columns. Hence this dataset is ina 
wide format. 


To convert to a long format, we can use a convenient function pivot longer. Within 
pivot longer we define, using the cols argument, which columns we want to pivot: 


infant deaths wide s>s 
pior longer (cols FS 200205) 


# A tibble: 5 x 3 

country name valu 
<chr> <chr> <dbl 
Afghanistan x2010 7460 
Afghanistan x2011 7200 
Afghanistan x2012 6950 
Afghanistan x2013 6710 
Afghanistan x2014 6480 


OS® WN 
ore: ©. ory id 


Very easy! 


We can observe that the resulting long format dataset has each country occupying 5 
rows (one per year between 2010 and 2015). The years are indicated in the variable 
names, and all the death count values occupy a single variable, values. 


A useful way to think about this transformation is that the infant deaths values used to be 
in matrix format (2 dimensions; 2D), but they are now in a vector format (1 dimension; 
1D). 


This long dataset will be much more handy for many data analysis procedures. 


As a good data analyst, you may find the default names of the variables, names and 
values, to be unsatisfactory; they do not adequately describe what the variables contain. 
Not to worry; you can give custom column names, using the arguments names _ to and 
values to: 


infant deaths wide %>% 
Pivot longer (cies 20020 
name oeo My Saru, 
vetes ise) = Meleyehelals} COLINE) 


## # A tibble: 5 x 3 


## country year deaths count 
++ <chr> <chr> <db1> 
## 1 Afghanistan x2010 74600 


## 2 Afghanistan 
## 3 Afghanistan 
## 4 Afghanistan 
## 5 Afghanistan 


2 a a a a a a a a a a A 5 PP 


EEEE ee es 


You may also want to remove the x in front of each year. This can be achieved with the 
convenient parse number () function from the {readr} package (part of the tidyverse), 


count ami 


x2011 
x2012 
x2013 
x2014 


72000 
69500 
67100 
64800 


Notice that the long format is more informative than the original wide 
SIDE NOTE format. Why? Because of the informative column name “deaths_count”. 
os In the wide format, unless the CSV is named something like 
fant_deaths, or someone tells you “these are the counts of 
infant deaths per country and per year”, you have no idea what the 
numbers in the cells represent. 


which extracts numbers from strings: 


infant_deaths wide %>% 


pivot_longer (cols = x2010:x2015, 


names to 
Weiluicis ice = W“elsehelas! cowie) 


"year", 


mutate (year = parse number (year) ) 


# A tibble: 5 
country 
<chr> 
Afghanistan 
Afghanistan 
Afghanistan 


Afghanistan 
Afghanistan 


Ua BUNE 


x 3 


year deaths count 


<db1> 
2010 
2011 
2012 
2013 
2014 


<db1> 
74600 
72000 
69500 
67100 
64800 


Great! Now we have a clean, long dataset. 


For later use, let’s now store this data: 


intantkdeaths tlong <= 


infant _deaths wide %>% 


Pav ciemVonges (Cols ee x2 OO RE AOS 


names to 
vedes eo.) Vdeavhsmecoumes) 


PRACTICE 


For this practice question, you will use the euro births wide dataset 
from Eurostat. It shows the annual number of births in 50 European 


(in RMD) countries: 


"year" n 


iT ee ee ee ee 


curom Joslisielnsy yack <= 
read csv (here("data/eúro births wide.csv”)) 
head(euro births wide) 


$ A tibble: 5x 8 
country SZ0ls = x20 20M S201 Six 2019) x20 20m scZ102Al 
<chr> <dol> <del <dbol>  <dbil>  <dol> <dbie <dbi> 


PRACTICE i Belgaum 122274 121896 119690 TI18319 117695 114350 118349 
A, 2 Bulgaria) 65950) 64984 "63955. 62197 o1532 9 59086 “Secs 

3 Czechia) 10764" TI2663 114405) 114086 A122311 110200 MA T3 

4 Denmark 38205 GLEIA “61S: “otic. Sold G7 60937163473 

(in RMD) 5 Germany 737575 792141 784901 787523 778090 773144 795492 


The data is in a wide format. Convert it to a long format data frame that 


“o 


has the following column names: “country”, “year” and “births_count” 


okeurorbirthsi longi 
euro births wide 2>% 7 Complete the code with your answer 


———— 


Pivoting long to wide 


Now you know how to pivot from wide to long with pivot longer (). How about going 
the other way, from long to wide? For this, you can use the fittingly-named 
pivot wider () function. 


But before we consider how to use this function to manipulate long data, let’s first 
consider where you're likely to run into long data. 


While wide data tends to come from external sources (as we have seen above), long data 
on the other hand, is likely to be created by you while data wrangling, especially in the 
course of group _by()-summarize() manipulations. 

Let’s see an example of this now. 

We will use a dataset of patient records from an Ebola outbreak in Sierra Leone in 2014. 


Below we extract this data from the {outbreaks} package and perform some simplifying 
manipulations on it. 


10 


Eloi, <= 
outbreaks::ebola sierraleone 2014 %>% 
as tibble() %>% 


ambhecuiers: (yea = lblonesiCleyss i yee (Cere fone ongen] orao 1 Sanae EAE VEE mA 
Haa Cece 
select (peiriant sel = ae, eher ler, year or joulsiee = veen) ; Salace cUe TENNE 
ebola 


# A tibble: 5 x 3 
patient id district year of onset 
<int> <fct> <dl> 
1 1 Kailahun 2014 
2 2 Kailahun 2014 
3 3 Kailahun 2014 
4 4 Kailahun 2014 
5 5 Kailahun 2014 


Each row corresponds to one patient, and we have each patient’s id number, their district 
and the year in which they contracted Ebola. 


Now, consider the following grouped summary of the ebola dataset, which counts the 
number of patients recorded in each district in each year: 


CASSE per volalisierealiohe, PSr yesan <= 
ebola %>% 
group _by(district) «> 
count (year of onset) 
ungroup () 


cases per district Iiperiycar 


# A tibble: 5 x 3 

district year of onset n 

<fct> <dbl> <int> 
1 Bo 2014 397 
2 Bo 2015 209 
3 Bombali 2014 1070 
4 Bombali 2015 120 
5 Bonthe 2014 7 


The output of this grouped operation is a quintessentially “long” dataset! Each 
observational unit (each district) occupies multiple rows (two rows per district, to be 
exact), with one row for each measurement (each year). 


So, as you now see, long data often can arrive as an output of grouped summaries, among 
other data manipulations. 


Now, let’s see how to convert such long data into a wide format with pivot wider(). 


The code is quite straightforward: 


CASeS PSE VGUSELUCEEpSrEVe aia oe5 
pivoti wider (values irom n, 
Hames) From = yeer Or ONSET) 


# A tibble: 5x 3 
district ~2014° °2015° 
<fct> <int> <int> 

1 Bo 397 209 

2 Bombali 1070 120 

3 Bonthe 7 77 

4 Kailahun 535 35 

5 Kambia 127 294 


As you can see, pivot _wider() has two important arguments: values _ from and 
names_from. The values_from argument defines which values will become the core of 
the wide data format (in other words: which 1D vector will become a 2D matrix). In our 
case, these values were in the n variable. And names _ from identifies which variable to use 
to define column names in the wide format. In our case, this was the year_of_onset 
variable. 


You might also want to have the years be your primary 
observational/statistical unit, with each year occupying one row. This can 
be carried out similarly to the above example, but the district variable 
will be provided as an argument to names_ from, instead of 
ycar of Onser. 


casei per Nclilsierilere jexsne Sele Fabs 
pivot_wider(values from = n, 


names from = district) 
SIDE NOTE = 

# A tibble: 2 x 15 

year of onset Bo Bombali Bonthe Kailahun Kambia Kenema 

© <dbl> <int> <LInt> <asmite> Salahe  -<ane> <at> 

a 2014 397 1070 al 335 OF, 641 
2 2015 209 TAO T 3:5 294 139 
t .. with 8 more variables: Koinadugu <int>, Kono <int>, 
# Moyamba <int>, Pore Loko” <int>, Pujehun <int>, 


Here the unique observation units (our rows) are now the years (2014, 
2015). 


a a a a aaa a a aa a a a a a a a a a a a a a a a a a a aa a a | 


See Eee eee eee eee ee l 


The population dataset from the tidyr package shows the populations 
of 219 countries over time. 


PRACTICE 
en y Pivot this data into a wide format. Your answer should have 20 columns 
and 219 rows. 
(in RMD) 


Q population widen <- 
tidyr: :population 


Why is long data better for analysis? 


Above we mentioned that long data is best for a majority of data analysis tasks. Now we 
can justify why. In the sections below, we will go through a few common operations that 
you will need to do with long data, in each case you will observe that similar 
manipulations on wide data would be quite tricky. 


Filtering grouped data 


First, let’s talk about filtering grouped data, which is very easy to do on long data, but 
difficult on wide data. 


Here is an example with the infant deaths dataset. Imagine that we want to answer the 
following question: For each country, which year had the highest number of child 
deaths? 


This is how we would do so with the long format of the data : 


inftant deaths long %>% 
group Dy (country) %>% 
Emi per (deaths) count  —— max (deaths count))) 


# A tibble: 5 x 3 
# Groups: country [5] 
country year deaths count 
<chr> <chr> <db1> 
1 Afghanistan *x2010 74600 
2 Angola x2010 79100 
3 Albania x2010 420 
4 United Arab Emirates x2011 687 
5 Argentina *x2010 9550 


Easy right? We can easily see, for example, that Afghanistan had its highest infant death 
count in 2010, and the United Arab Emirates had its highest death count in 2011. 


13 


If you wanted to do the same thing with wide data, it would be much more difficult. You 
could try an approach like this with rowwise (): 


infant_deaths wide %>% 
rowwise() %>% 
mutate (max count = max(x2010, x2011, x2012, x2013, «2014, x2015)) 


# A tibble: 5 x 8 
# Rowwise: 
country x2010 x2011 x2012 x2013 x2014 x2015 
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
1 Afghanistan 74600 72000 69500 67100 64800 62700 
2 Angola 79100 76400 73700 71200 69000 67200 
3 Albania 420 384 354 331 313 301 
4 United Arab Emirates 683 687 686 681 672 658 
5 Argentina 9550 9230 8860 8480 8100 7720 
# .. with 1 more variable: max count <dbl> 


This almost works—we have, for each country, we have the maximum number of child 
deaths reported—but we still don’t know which year is attached to that value in 

max count. We would have to take that value and index it back to its respective year 
column somehow... what a hassle! There are solutions to find this but all are very painful. 
Why make your life complicated when you can just pivot to long format and use the 
beauty of group by() and filter()? 


| A A A OO aS S a R O OAN A R A: OSE OA ORR; O S R E sO O SAE i S ON O A OOOD ORRE A s OO i O, S, S S 


Here we used a special {dplyr} function: rowwise (). rowwise () allows 
further operations to be applied per-row . It is equivalent to creating one 
group for each row (group by (row_number ())). 


Without rowwise () you would get this : 


SIDE NOTE : 
AA intant deaths wide %>% 


mutate (max count = max(x2010, x2011, x2012, x2013, «2014, 
AOEDD 


# A tibble: 5 x 8 
country AOL eZ OTT eZ 2 ec Z OS ZOE eZ OS 
LCN <dbol> <dbi> <dbl> <dbl> <dbl> <dpi> 
1 Afghanistan 74600 72000 69500 67100 64800 62700 
2 Angola 79100 76400 73700 71200 69000 67200 
3 Albania 420 384 354 330 313 301 
4 United Arab Emirates 683 687 686 681 672 658 


a ee R: N R O a R R R O y O R SR y OA S OOA SRA R OO SR O R A O a R 


ee eee eee eee ee À 


14 


## 5 Argentina 9550 9230 8860 8480 
## E 


PRACTICE 


SIDE NOTE 


with 1 more variable: max count <dbl> 


..the maximum count over ALL rows in the dataset. 


e00 T7720 


hee eee eee 


For this practice question, you will perform a grouped filter on the long 
format population dataset from the tidyr package. Use group by () 


and filter () to obtain a dataset that shows the maximum population 


recorded for each country, and the year in which that maximum 
population was recorded. 


(in RMD) 


Q population max <- 
population 


Summarizing grouped data 


Grouped summaries are also difficult to perform on wide data. For example, considering 
again the infant deaths_long dataset, if you want to ask: For each country, what was 
the mean number of infant deaths and the standard deviation (variation) in deaths ? 


With long data it is simple: 


infant_deaths_ long %>% 


group by (country) 


Q Q 
OS 


summarize (mean deaths = mean(deaths count), 
sd_deaths = sdi(deaths count) ) 


With wide data, on the other hand, finding the mean is less intuitive... 


# A tibble: 5 
country mean deaths sd deaths 
<chr> <db1> <db1> 

1 Afghanistan 68450 4466. 

2 Albania 350; 45.2 

3 Algeria 21033. 484. 

4 Angola 72767. 4513. 

5 Antigua and Barbuda LOs-7 0.816 


infant _deaths wide %>% 
rowwise() %>% 


mutate (mean deaths 


= Sumi :0i 0 x 2ONs A040 25, 
x2013, x2014, x2015, na.rm = T)/6) 


# A tibble: 5 x 8 
# Rowwise: 
country x2010 x2011 x2012 x2013 x2014 x2015 
<chr> <db1l> <dbl> <dbl> <dbl> <dbl> <dbl> 
1 Afghanistan 74600 72000 69500 67100 64800 62700 
2 Angola 79100 76400 73700 71200 69000 67200 
3 Albania 420 384 354 331 313 301 
4 United Arab Emirates 683 687 686 681 672 658 
5 Argentina 9550 9230 8860 8480 8100 7720 
# .. with 1 more variable: mean deaths <dbl> 


And finding the standard deviation would be very difficult. (We can’t think of any way to 
achieve this, actually.) 


For this practice question, you will again work with the long format 
population dataset from the tidyr package. 


PRACTICE Use group by() and summarize () to obtain, for each country, the 
maximum reported population, the minimum reported population, and 
the mean reported population across the years available in the data. Your 


(in RMD) data should have four columns, “country”, “max_population’, 
“min_population™ and “mean_population”. 


Q population summaries <- 
population 


Plotting 


Finally, one of the data analysis tasks that is MOST hindered by wide formats Is plotting. 
You may not yet have any prior knowledge of {ggplot} and how to plot so we will see the 
figures without going in depth with the code. What you need to remember is: many plots 
with with ggplot are also only possible with long-format data 


Consider again the infant_deaths data infant deaths long. We will plot the number of 
deaths for Belgium per year: 


infant_deaths_ long %>% 
filter (country == "Belgium") %>% 
gge lot) a 
Geomucoil\(acs (= yearn, y — acari s Eeount))) 


deaths count 
N 
s 


400 - 
300 - 
| 
100 - 
0- 


x2010 x2011 x2012 x2013 x2014 x2015 
year 


The plotting works because we can give the variable year for the x-axis. In the long 


format, year is a variable variable of its own. In the wide format, each there would be no 
such variable to pass to the x axis. 


Another plot that would not be possible without a long format: 


infant_deaths_ long %>% 
head(30) %>% 
jello taes Ge = vcn, y = Cecwhe Corine; Sero = Corme, Coler = COE) or 
ofsyounl_ Iatiaves(()) ar 
geom point () 


80000 - 


60000 - 
country 
i= —® Afghanistan 
8 -e Albania 
ın! 40000 - 
£ -® Angola 
K : 
3 -®- Argentina 
—®- United Arab Emirates 
20000 - 
0- 0 — 
x2010 x2011 x2012 x2013 x2014 x2015 
year 


Once again, the reason is the same, we need to tell the plot what to use as an x-axis anda 
y-axis and it is necessary to have these variables in their own columns (as organized in the 
long format). 


Pivoting can be hard 


We have mostly looked at very simple examples of pivoting here, but in the wild, pivoting 
can be very difficult to do accurately. This is because the data you are working with may 
not have all the information necessary for a successful pivot, or the data may contain 
errors that prevent you from pivoting correctly. 


When you run into such cases, we recommend looking at the official documentation of 
pivoting from the tidyr team, as it is quite rich in examples. You could also post your 
questions about pivoting on forums like Stack Overflow. 


Wrap Up! 


You have now explored different datasets and how they are either in a long or wide 
format. In the end, it’s just about how you present the information. Sometimes one 
format will be more convenient, and other times another could be best. Now, you are no 
longer limited by the format of your data: don't like it? change it ! 


Contributors 


The following team members contributed to this lesson: 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 
Passionate about world improvement 


LAURE VANCAUWENBERGHE 


9 Data analyst, the GRAPH Network 
A firm believer in science for good, striving to ally programming, health 
and education 


Lesson notes | Advanced pivoting 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


ITO goo oc: aes es ded eee Sa we Re OP E ale, ee ale Bay hee ek a 
Learning ODJECHVES oei aig nied We ees OES ES OG He AES A OD OES ODEO GES ENA 
Pee. au be bed aos 4a en Woe dane 544 eee sabe oh Hes heey oeaheg 64445-4044 oe 
Dot el 5 oi eae en a Bd AS Be Es SA AE OS Oe He ee Hea Ried BA Gees SHG STE a eae 
Widë to IOMO sei pith tbe eRe nee eR See oat tah eee pe eee ee bees casa ea ee ae 
Understanding names sep and “Va is ss ce vrntn ons suten ve wee eer oee ee ee ikt ie 
Voile type before Tie separato 46 vo ds Aeeeek eek aed ad oe eS e ee hae Oe ere dene mes 
A non-time-series example .......... 0.0.00 ce eee eee eee 
Escaping the dotseparatO soo 64.8 e SS a FAREREFTAE DSR EGAT EELS ADE REEREAR EOE 
What to do when you don't have a neat separator ? .... 2.2... ee ee 
LNO O NS ne. gtk: dc achat i eR uh aS DES GEE Saks Shih h ge E 5 a Re NS seems 2 te 9h, 4d ek 
WP UDT eese eee oe he te Oe eee ee eee. ee ee ee eee ee ae eee eee wee eee ee 


Intro 


You know basic pivoting operations from long format datasets to wide format datasets 
and vice versa. However, as is often the case, basic manipulations are sometimes not 
enough for the wrangling you need to do. Let’s now see the next level. Let's go ! 


Learning Objectives 
1. Master complex pivoting from wide to long and long to wide 


2. Know how to use separators as a pivoting tool 


Packages 


# Load packages 
if (!require(pacman)) install.packages ("pacman") 
pacman::p_load(tidyverse, outbreaks, Janitor, rio, here, knitr) 


Datasets 


We will introduce these datasets as we go along but here is an overview: 


e Survey data from India on how much money patients spent on tuberculosis 
treatment 


e Biomarker data from an enteropathogen study in Zambia 


e A diet survey from Vietnam 


Wide to long 


Sometimes you have multiple kinds of wide data in the same table. Consider this artificial 
example of heights and weights for children over two years: 


(lolub ike Gheshes: <— 
iEALloloLies8 r emigolel 
~child, ~yearl height, ~year2 height, ~yearl weight, ~year2 weight, 


NU WROCI Wigiyemiuy H5 komi, HELO ep 
UBUD TES EmU "90cm", TTA TEZRO, 
veus "90cm", WiAOOemiy, etsy "14kg" 


Chaklalisieaits 


## # A tibble: 3 x 5 
Ht child yearl height year2 height yearl weight year2 weight 


## <chr> <chr> <chr> <chr> <chr> 
## 1A 80cm 85cm 5kg 10kg 
## 2 B 85cm 90cm 7kg 12kg 
##3 C 90cm 100cm 6kg 14kg 


If you pivot all the measurement columns, you'll get overly long data: 


lsat ilel grate =i. 
pivot_longer (2:5) 


# A tibble: 5 x 3 

child name value 

<chr> <chr> <chr> 
1 A yearl height 80cm 
2A year2 height 85cm 
3 A yearl weight 5kg 
4A year2 weight 10kg 
5 B yearl height 85cm 


This is not what you (usually) want, because now you have two different kinds of data in 
the same column—weight and height. 


To get the right shape, you'll need to use the names_sep argument and the “.value” 
identifier: 


Clasliel Sees ooe 
pivot elongers (Zor, 
Meiners) Sei = Ny 
names ele) = @((Miexcueavexel, Va va e) 


# A tibble: 5 x 4 
child period height weight 
<chr> <chr> <chr> <chr> 
1A yearl 80cm 5kg 
2A year2 85cm 10kg 
3B yearl 85cm 7kg 
4 B year2 90cm 12kg 
KIE yearl 90cm 6kg 


Now we have one row for each child-period, an appropriately long format! 


What the code above is doing may not be clear, but you should already be able to answer 
the practice question below by pattern matching with our example. After the practice 
question, we will explain the names_ sep argument and the “.value” identifier in more 
depth. 


Consider this other artificial data set: 


adulie stats <> 
tibble::tribble ( 
sadulite, year IBMI; ~year2 BMI, Syce ar HD Vi, year 2 Hin, 


TAS 257 30), EPOS IEIS NAPOS ItCIVeN, 
HIBRI 34, Oe Winiseelieakifo, VPO SI tivan, 
BOU ILS) 17, "Negative", "Negative" 


PRACTICE 
adult stats 
(in RMD) 
## # A tibble: 3 x 5 
## adult yearl BMI year2 BMI yearl HIV year2 HIV 
## <chr> <db1> <dbl> <chr> <chr> 
## 1A 25 30 Positive Positive 
## 2 B 34 28 Negative Positive 
Tr 3 C 19 17 Negativ Negativ 


Pivot the data into a long format to get the following structure: 


adult year BMI HIV 


PRACTICE | # Q adult_long <- 

i? exec Ee aa 
# pivot longer( 
(in RMD) OS ae ae 


The chila stats example above has numbers stored as characters [...] 


As you saw in the previous lesson, you can easily extract the numbers 
from the output long data frame in our example using the 
parse number () function from readr: 


| child stats long <= 
Chatidesicaes so 
pPivor Tonger (2: oF, 
names sep =- 1 r, 
nemese to = C (VoerLocW, M yelut) 


| child_stats_long 


# A tibble: 5 x 4 
child period height weight 
sehr <chr> Che” eachic> 
A yearl 80cm 5kg 
year2 85cm 10kg 


SIDE NOTE 


yearl 85cm 7kg 
year2 90cm 12kg 


4 H H SHE HE SHE HE HE 


O a w N A 
opostos 


yearl 90cm 6kg 


| child stats long %>% 
mutate (height = parse number (height) 
weight = parse number (weight) 


) 


## # A tibble: 5 x 4 
Ht child period height weight 
## <chr> <chr> <dbil=<dbi= 


## 1A yearl 80 5 
## 2 A year2 85 10 
## 3 B yearl 85 F 
## 4B year2 90 12 
## 5 C yearl 90 6 


Understanding names_sep and “.value” 


Now let’s break down the pivot longer () call we saw above a bit more: 


echi Kel Shecherss 


## # A tibble: 3 x 5 
Ht child yearl height year2 height yearl weight year2 weight 


Ht <chr> <chr> <chr> <chr> <chr> 
## 1A 80cm 85cm 5kg 10kg 
## 2 B 85cm 90cm 7kg 12kg 
## 3 C 90cm 100cm 6kg 14kg 


child stats %>% 
pivot_longer (2:5, 
nieiites Sey = MM, 
mames wo = iei(MerSieatioyel 5, S rwa mien) 


# A tibble: 5 x 4 
child period height weight 
<chr> <chr> <chr> <chr> 


1A yearl 80cm 5kg 
2A year2 85cm 10kg 
3B yearl 85cm 7kg 
4 B year2 90cm 12kg 
5 iC yearl 90cm 6kg 


Notice that the column names in the original child stats data frame (yearl_ height, 
year2_ height and so on) are made of three parts: 


e the period being referenced: e.g. “year1” 


wo 


e an underscore separator, 


== F 


e and the type of value recorded “height” or “weight” 


We can make a table with these parts: 


column_name period separator “.value” 


yearl height yearl _ height 
year2 height year2 _ height 
yearl weight yearl _ weight 
year2 weight year2 _ weight 


Based on that table, it should now be easier to understand the names_sep and names_to 
arguments that we supplied to pivot longer (): 


names sep = 


This is the separator between the period indicator (year) and the values (year and 
weight) recorded. 


If we have a different separator, this argument would change. For example, if the 


separator were an empty space, ” ", you would have names sep = " ",asseeninthe 
example below: 


child sitats space sep <= 
tibble::tribble ( 
schildye Viel hergin =- yr2 height =~ yrl werghit 7 ~ yE2 welght 7 


LU wigiOremiuy wec emu, voko, OTEA 
WRI LESEM "90cm", MP Teel UA ol, 
iow "90cm", “WhOOemty, UEko, "14kg" 


child stats space sep %>% 
pivot_longer (2:5, 
meiit=ss! Seo = Vu, 


memes ise) = Yel (WiexSicaloyelit, Er vetat) 


# A tibble: 5 x 4 
child period height weight 
<chr> <chr> <chr> <chr> 

1A yrl 80cm 5kg 

2A yr2 85cm 10kg 

3 B yrl 85cm 7kg 

4 B yr2 90cm 12kg 

5- yrl 90cm 6kg 

names to = c("period", ".value") 


Next, the names_to argument indicates how the data should be reshaped. We passed a 


vector of two character strings , “period” and the “.value” to this argument. Let’s consider 


each in turn: 


The “period” string indicated that we want to move the data from each year (or period) 
into a separate row Note that there is nothing special about the word “period” used here; 


we could change this to any other string. So instead of “period”, you could have written 
“time” or “year_of_measurement” or anything else: 


eleinLikel Giese! <5 
pivoti Vonger (Zor, 
nailer: see: = UN, 
memeS CE = te (year 0E imlselsibastcuncioe ip So Ventet 


# A tibble: 5 x 4 
child year of measurement height weight 
<chr> <chr> <chr> <chr> 
1A yearl 80cm 5kg 
2A year2 85cm 10kg 
3 B yearl 85cm 7kg 
4 B year2 90cm 12kg 
2 C yearl 90cm 6kg 


Now, the “.value” placeholder is a special indicator, that tells pivot longer () to make 
a separate column for every distinct value that appears after the separator. In our 
example, these distinct values are “height” and “weight”. 


The “.value” string cannot be arbitrarily replaced. For example, this won't work: 


child stats %>% 
pivo r Pionger(2 Si, 


weiter Sel = MIN, 


manes To = teh(Mersicalioxol, Yeltes) 


# A tibble: 5 x 4 


child period values value 

<chr> <chr> <chr> <chr> 
1A yearl height 80cm 
2A year2 height 85cm 
3 A yearl weight 5kg 
4A year2 weight 10kg 
5 B yearl height 85cm 


To restate the point, the “.value” placeholder is tells pivot_longer () that we want to 
separate out the “height” and “weight” values into separate columns, because there are 
the two value types that occur after the “_” separator in the column names. 


This means that if you had a wide dataset with three types of values, you would get 
separated-out columns, one for each value type. For example, consider the mock dataset 
below which shows children’s records, at two time points, for the following variables: 


e age in months, 
e body fat % 
e bmi 


chasld@sitarsmehncesvaluesm<— 
tibble::tribble ( 
veki “Ail eles, “wea eles, “wil cewy “eA cele, ell lome Ac loyal, 
Wa. Memes, “ehinhelasy LSE EROS ey il, 157 
TOU MENS u W Sianelars TESIM DET y Gs, 18 
) 


chamidesitabsmthreeivalues 


## # A tibble: 2 x 7 

# child tl_age t2_age tl fat t2 fat tl_bmi t2 bmi 
# <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> 
## la 5mths 8mths 13% 15% 14 15 
## 2 b 7Tmths 9mths 15% 17% 16 18 


Here, in the column names there are three value types occurring after the “_” separator: 


age, fat and bmi; the “.value” string tells pivot longer () to make a new column for 
each value type: 


child estas three values! ses 
pivot iongeri(2: T; 
name oMSep = MM, 
nemes wey = (Wiest, W euet) 


) 


# # A tibble: 4 x 5 
#4 child time age fat bmi 
# <chr> <chr> <chr> <chr> <dbl> 
la tl Smths 13% 14 
2a t2 8mths 15% 15 
3b t1 Tmths 15% 16 
## 4 b t2 9mths 17% 18 


A pediatrician records the following information for a set of children over 
two years: 


PRACTICE e head circumference; 
e neck circumference; and 
e hip circumference 

(in RMD) 


all in centimeters. 


The output table resembles the below: 


10 


growth stats <= 
tibble: :tribble ( 
Senimi nasil lokeretel y2 nead ye l Mecke y rA neck ye ihip y2] 


Me 45, 48, Bai, 2AT Syl, 
D2 

KD 48, 50; 24, 267, 527 
I 

Hey 50 527 24, 207 597 
54 


growth stats 


## # A tibble: 3 x 7 
## child yri head yr2 head yrl neck yr2 neck yrl hip yr2_ hip 


PRACTICE ## <chir> <dbil> <db1> <dbl> <db1> <db1> <db1> 
## loa 45 48 23 24 51 52 
## 2 b 48 50 24 26 52. 52 
(in RMD) ## 3c 50 52 24 27 53 54 


Pivot the data into a long format to get the following structure: 


child year head neck hip 


# Q growth stats long <- 
# growth stats %>% 
# pivot_longer ( ) 


growth stats %>% 
pivot_longer (2:7, 
mains Se = YY 
memec isle) = eri yaya, Yo Nellie) )) 


Value type before the separator 


In all the example we have used so far, the column names were constructed such that 
value type came after the separator (Recall our table: 


column_name period separator “.value” 


yearl height yearl _ height 
year2 height year2 _ height 
yearl weight yearl _ weight 
year2 weight year2 _ weight 


But of course, the column names could be constructed differently, with the value types 
coming before the separator, as in this example: 


(ClalaLILel sieeheisis <= 
taiolo e em idyllen 
~child, ~height_yearl, ~height_year2, ~weight_yearl, ~weight_year2, 


UAN ASHOKE WS Semi wiki WIL eep 
He Wigs emu "90cm", URIN BIE 
meu "90cm", U ONOI MEkgu, TLAKU 


Chikdiisitats2 


## # A tibble: 3 x 5 
++ child height _yearl height year2 weight yearl weight year2 


# # <chr> <chr> <chr> <chr> <chr> 
## 1 A 80cm 85cm 5kg 10kg 
## 2 B 85cm 90cm 7kg 12kg 
#3 C 90cm 100cm 6kg 14kg 


won 


Here, the value types (height and weight) come before the separator. 


How can our pivot_longer () command accommodate this? Simple! Just swap the order 
of the vector given to the names_to argument: 


So instead of names _to = c("time", ".value"), you would have names to = 
c(".value", "time"): 


child_stats2 %>% 
pivot vonger (Zo), 
inenners: Siejsy = MY 
mewes CO = EH owen,  Wreslints)) )) 


# A tibble: 5 x 4 
child time height weight 
<chr> <chr> <chr> <chr> 


1A yearl 80cm 5kg 
2A year2 85cm 10kg 
3 B yearl 85cm 7kg 
4 B year2 90cm 12kg 
2 € yearl 90cm 6kg 
And that’s it! 
PRACTICE 
Consider the following data set from Zambia about enteropathogens and 
g their biomarkers. 
(in RMD) 


PRACTICE 


(in RMD) 


enteropathogens zambia wide<- 
read_csv (here ("data/enteropathogens zambia wide.csv") ) 


## Rows: 297 Columns: 7 
## — Column specification 


t Dellamuters 1)" 
tr Cb (7 ID LES 17 LES 2, LBP: 17 LBE2; ITEABP 1, IEABP 2 
Ht 

## i Use `spec()` to retrieve the full column specification 
foc tnis data: 

## i Specify the column types or set `show_col_types = FALSI 


E] 


to quiet this message. 


enteropathogens zambia wide 


## # A tibble: 5 x 7 

#4 ID LPS 1 LPS 2 LBP 1 LBP 2 IFABP 1 IFABP 2 
i? <dbl> <dbi> <dbi> <dbl> <dbl> <dbl>  <dbi> 
tt 1) 1002 222 390. 383414. 6840. 1204. 610. 
## 2 1003 181. NA 26888. NA 22.5 NA 
tt 3 1004 257: 221% 49183. 5426. 0 0 
## 4 1005 NA 369). NA. 1938. 0 LOL, 
te 5 1006 275. NA 61758. NA 0 NA 


This data frame has the following columns: 


e LPS_1 and LPS_2: lipopolysaccharide levels, measured by 
Pyrochrome LAL, in EU/mL 


e LBP_1 and LBP_2: LPS binding protein levels, in pg/mL 


e IFABP_1 and IFAPB_2: intestinal-type fatty acid binding protein levels, 
in pg/mL 


Pivot the dataset so that it resembles the following structure 


enteropathogens zambia long <- 
enteropathogens zambia wide %>% 
pivot longer (ITD; 
Names e0 = CV veltet Sample Cowimet)] 
newe SED = A i) 


# A tibble: 5 x 5 

TD sample: count LPS LBP IFABP 

PRACTICE <dbl> <chir> <dbi> <dbl> <dpI> 
Le O02 a 222% S8A U4) L294. 
Ze OO 22 S905 “6840. 610; 

2 Sia) LOGS TI 181. 26888. 22.9 
(n RMD) 4 1003 2 NA NA NA 
> 1004 1 DS e AINE 0 


A non-time-series example 


So far we have been using person-period (time series) datasets to illustrate the idea of 
complex pivots with multiple value types. 


But as we have mentioned, not all reshape-requiring datasets are time series data. Let's 
see a quick non-time-series example [...] 


You might measure the height (cm) and weight (kg) of a series of parental couples ina 
table like this: 


icelidaLily/ Sieehes <= 
tibble::tribble ( 
COUPLE; ba ehe rn neroh te, sen tathorewelght,s <mother inenchits “mot nenewerchit, 


Wren y 180, 80, 160, HOF 
Mo 157 907 LOY 76, 
Mel SZ, OST, 143, 78 


Family Stats 


# A tibble: 3 x 5 
couple father height father weight mother height 


<chr> <db1> <dbl> <db1> 
la 180 80 160 
2b 185 90 150 
3c 182 93 143 
# .. with 1 more variable: mother weight <dbl> 


Here we have two different types of values (weight and height) for each person in the 
couple. 


To pivot this to one-row per person, we'll again need the names _sep and names to 
arguments: 


family stats %>% 
DivioteVong ers (Ziti, 
namest Scpi = 


names roi COPE rsonu rnae) 


14 


# A tibble: 5 x 4 

couple person height weight 

<chr> <chr> <dbl> <dbl> 
la father 180 80 
2a mother 160 70 
3b father 185 90 
4b mother 150 76 
5€ father 182 93 


eo 


The separator is an underscore, “_", so we used names sep = "_" and because the value 
types come after the separator, the “.value” identifier was placed second in the names to 
argument. 


Escaping the dot separator 


A special example may crop up when you try to pivot a dataset where the separator is a 
period. 


chavlidisitalsirdorascpe <= 
tibble::tribble ( 
~child, ~yearl.height, ~year2.height, ~yearl.weight, ~year2.weight, 


LA WgiOemtuy WSS EmU vokon LALO ey 
UBU TS emi "90cm", TH opts Wao, 
Wea "O0cm", "100cm", uekgu, WA koe 


chaldestarswdorRssp: ek 
PIV CE ongger 2S, 
nemes wo = (Meier, WV yee y 
names_sep = "\\.") 


# A tibble: 5 x 4 
child period height weight 
<chr> <chr> <chr> <chr> 
1A yearl 80cm 5kg 
2A year2 85cm 10kg 
3 B yearl 85cm 7kg 
4 B year2 90cm 12kg 
5. C yearl 90cm 6kg 


C wn 


There we used the string ^.” to indicate a dot “.” because the *.” is a special character in R, 
and sometimes needs to be escaped 


PRACTICE 


Consider again the adult_stats data you saw above. Now the column 


names have been changed slightly. 
(in RMD) 


aqulli scacs dot isep 
egoe g SeS 


sadulit EA BMI ycacnl E BMT year AA HN year 
MATV. year? fe 
DON 25 207 "Positive", 
"Positive", 
Wisi 34, 23, "Negative", 
Mefeysahicalycew 
NEN; ALS) 5 Ah "Negative", 
"Negative" 


adult stars dot Sep 


PRACTICE ## # A tibble: 3 x 5 
## adult BMI.yearl BMI.year2 HIV.yearl HIV.year2 


## <ehr> <db1> <dl>. <chr> <ehre 
7 ## 1A 25 30 Positive Positive 
(in RMD) ## 2 B 34 28 Negative Positive 
tt 3 G 19 17 Negativ Negativ 


Again, pivot the data into a long format to get the following structure: 


adult year BMI HIV 
# Qvadult2 Jong -= 
# EIEEE EENES a= a 
# pivot longer ( ) 
adult_stats_ dot_sep %>% pivo fkonger2A5r 


names sep = "\\.", 
aeee CO = CNS Nelllne,  yyeielie')) )) 


What to do when you don't have a neat separator ? 
Sometimes you do not have a neat separator. 


Consider this survey data from India that looked at how much money patients spent on 
tuberculosis treatment: 


to visits <- reod cov (here ("data/india tb pathways and costa data.csr?)) -3 
clean_names() %>% 


Select (del sialiesie vieni location, irse vlel Cost, SScConel visit lkoxeene ations, 
Secon yiee Cost, calco viste locartomy Cuire visie COST) 


Rows: 880 Columns: 22 

— Column specification 
Delimiter: "," 
chr (10): Sex, Education, Employment, Alcohol, Smoking, Form of TB, Ch... 


dbl (12): id, Age, Wtinkgs, HtinCms, bmi, Diabetes, first visit cost, 


i Use ‘spec()* to retrieve the full column specification for this data. 
i Specify the column types or set `show col types = FALSE to quiet this 


message. 


Elo) WaLGiaLics! 


# A tibble: 5 x 7 
id first visit location first visit cost 


<dbl> <chr> <db1> 
1 100202 GH 0 
2 100396 Pvt. docto 1500 
3 100590 Pvt. docto 2000 
4 100687 Pvt. hospi 20000 
5 100784 Pvt. docto 1000 
# .. with 4 more variables: second_visit location <chr>, 
# second visit cost <dbl>, third visit location <chr>, 


It does not have a neat separator between the time indicators (first, second, third) and 
the value type (cost, location). That is, rather than something like “firstvisit_location”, we 
have instead “first_visit_location”, so the underscore is used for two purposes. For this 
reason, if you try our usual pivot strategy, you will get an error: 


EDEVARS Tie SE wo 
pivot_longer (2:7, 
memes ice = Ciise Coit Wawel’) 5 
ineiitais: Sel = VM) 


Error in “pivot longer spec()`: 

! Can't combine “first visit location* <character> and “first visit cost 
<double>. 

Run ‘rlang::last_error()* to s where th rror occurred. 


The most direct way to reshape this dataset successfully would be to use special “regex” 
(string manipulation), but you likely have not learned this yet! 


So for now, the solution we recommend is to manually rename your columns to insert a 


clear separator, “__°: 


tb visits renamed! <— 
tbh visits 3>% 


renome eirs t ivi oel oecacion 
Test voleli CORE, 


first visiti (osha = 


second visit location 


= anabigishe ivist Ieyerse solely, 


= seconds elocaraonl 


secon alae cosir— sieclome vilelt COSE; 


elolabietel  avabisjalve  Ivexersheayeyal 
ehirdivisitiecost) 


Eao velenie cose = 


tbh visits renamed 


# A tibble: 5 x 7 


= thasrd avast i location, 


id first visit location first visit cost 


<dbl> <chr> 


Now we can try the pivot: 


wo volses long = <> 
tb visits renamed %>% 
pivet_longer(237, 
namesi tor- 
names sep = 
tb visits: long 


Cienie Colac, 


a 


<db1> 
0 
1500 
2000 
20000 
1000 
second visit location <chr>, 


1 100202 GH 

2 100396 Pvt. docto 

3.100590 Pvt. docto 

4 100687 Pvt. hospi 

5 100784 Pvt. docto 

# .. with 4 more variables: 

# second visit cost <dbl>, 


We WELLE) 


id visit count visit location visit cost 


Now let's polish the data frame: 


docto 
clini 


# A tibble: 5 x 4 

<db1l> <chr> <chr> 
1 100202 first GH 
2 100202 second <NA> 
3 100202 third <NA> 
4 100396 first Pvt. 
5 100396 second Pvt. 


<db1> 


tb visits long %>% 
# remove nonexistent entries 
leei (UavalSakic loeswe lom == HN oa 
7 Give significant’ naming, to che velenie counk Values 


muterte (visiit Como = Cace (hic (Wale Cotine == Vises! = iy 
viele Cote == “secomel! = Z; 
vlene Crowle == Sence = 8)) oa 


7 ametis yoker GOSE IS Pobune ez 
MMIC (vieti Close = ag numeric (viert COEt) ) 


## # A tibble: 5 x 4 

id visit count visit location visit cost 
#4 <db1> <dbl> <chr> <db1> 
## 1 100202 1 GH 0 
# 2 100396 1 Pvt. docto 1500 
## 3 100396 2 Pvt. clini 1000 
## 4 100396 3 Pvt. hospi 2500 
## 5 100590 1 Pvt. docto 2000 


Above, we first remove the entries where we do not have the visit location information 
(i.e. we filter out the rows where the visit location variable is set to "" ). We then convert 
to numeric values the visit count variable, where the strings "first" to "third" are 
converted to numerical entries 1 to 3. Finally, we ensure the variable of visit cost is 
numeric using mutate () and the helper function as.numeric(). 


We will use a survey data about diet from Vietnam. Women in Hanoi were 
interviewed about their food shopping, and this was used to create 
nutrition profiles for each women. Here we will use a subset of this data 
for 61 households who came for 2 visits, recording: 


e enerc_ kcal w_1: the consumed energy from ingredient/food 
(Kcal) during the first visit (with 2 for the second visit) 
PRACTICE 
e dry w_1: the consumed dry from ingredient/food (g) during the 
first visit (with _2 for the second visit) 
(in RMD) 
e water w 1: the consumed water from ingredient/food (g) during 
the first visit (with _2 for the second visit) 


e fat w _ 1: the consumed Lipid from ingredient/food (g) during the 
first visit (with 2 for the second visit) 


dlerrdiverstiy, vtetmam wade <> 
read csv (here ("data/dveridiversity vietnam wide esv™))) 


## Rows: 61 Columns: 9 
## — Column specification 


tr Delamicor:s 4 1 


## dbl (9): household id, enerc_kcal_w_1, enerc_kcal_w 2, 
dry w lo Ory W e 

HF 

## i Use `spec()` to retrieve the full column specification 


for this data- 
## i Specify the column types or set show COl types = FALSE” 
to quiet this message. 


diet diversity vietnam wide 


## # A tibble: 5 x 9 


## household id enerc_kcal_w_1 enerc_kcal_w_2 dry w 1 
dry w 2 

PRACTICE Ht <dbil> <dbl> <db1l> <db1> 
<db1l> 
## 1 348 2268; 13867 548. 
Zed; 

(in RMD) ## 2 354 D775: 1240. 600. 
284. 
## 3 53 3104. 20S. 646. 
451. 
## 4 18 2302. 2146. 620; 
807. 
## 5 2 1298 aie eei PASS 
208. 


## # .. with 4 more variables: water w_1 <dbl>, 
## # water w 2 <dbl>, fat w 1 <dbl>, fat w 2 <dbl> 


You should first distinguish if we have a neat operator or not. Based on 
this, rename your columns if necessary. Then bring the different visit 
records (1 and 2) into a sole column for energy, fat weight, water weight 
and dry weight. In other words, pivot the dataset into long format. 


ip Ordet diversity vietnam wore, <— 


o 


# diet diversity vietnam wida %>% 


# pivot long( ) 


20 


Long to wide 


We just saw how to do some complex operations wide to long, which we saw in the 
previous lesson is essential for plotting and wrangling. Let’s see the opposite 
transformation. 


It could be useful to put long to wide to do different transformations, filters, and 


processing NAs. In this format, your measurements / collected data become the columns 
of the data set. 


Let's take the Zambia enteropathogen data, and this time, let’s take the original ! Indeed, 
what you were handling before was a dataset prepared for you, in a wide format. The 


original dataset is long and we will now see the data preparation | did beforehand, 
behind the scenes. You're almost becoming the teacher of this lesson ;) 


enteropathogens zambia long <- 
read_csv (here ("data/enteropathogens zambia_long.csv") ) 


Rows: 417 Columns: 5 

— Column specification 

Delimiter: "," 

dbl (5): ID, group, LPS, LBP, IFABP 


i Use `spec()` to retrieve the full column specification for this data. 
i Specify the column types or set ‘show col types = FALSE to quiet this 


Gl 


message. 


enteropathogens zambia long 


# A tibble: 5 x 5 
ID group LPS LBP IFABP 
<dbl> <dbl> <dbl> <dbl> <dbl> 
1 1002 1 222. 38414. 1294. 
2 1002 2 390. 6840. 610. 
3 1003 1 181. 26888. 225 
4 1004 2 221. 5426. 0 
5 1004 1 257. 49183. 0 


This is how we convert it from long to wide: 


21 


enteropathogens zambia wide <- 
enteropathogens zambia long %>% 
pivot wider ( 
names irom — group, 
values from = CTPS IBP ATEABE) 


) 


enteropathogens zambia wide 


# A tibble: 5 x 7 
ID LPS 1 LPS 2 LBP 1 LBP 2 IFABP 1 IFABP 2 
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
1 1002 222. 390. 38414. 6840. 1294. 610. 
2 1003 181. NA 26888. NA 2269 NA 
3 1004 257. 221. 49183. 5426. 0 0 
4 1005 NA 369). NA 1938. 0 1010. 
5 1006 275. NA 61758. NA 0 NA 


You can see that the values of the variable group (1 or 2) are added to the values’ names 


(LPS, LBP, IFABP) to create the new columns representing different group data: for 
example, LPS_1 and LPS 2. 


We are considering this “advanced” pivoting because we are pivoting wider several 
variables at the same time, but as you can see, the syntax is quite simple—the same 
arguments are used as we did with the simpler pivots in the previous lesson—names _ 
and values from. 


from 


Let's see another example, using the diet survey data from Vietnam that you manipulated 


previously: 


diet diversity vietnam_long <- 
read _csv (here ("data/diet diversity vietnam long.csv") ) 


Rows: 122 Columns: 6 
— Column specification 
Delimiter: "," 


i Use ‘spec()~ to retrieve the full column specification for this data. 


message. 


diet diversity vietnam_long 


## # A tibble: 5 x 6 
Ht visit number household id enerc kcal w dry w water w fat w 
Ht <db1> <db1> <dbl> <dbl>  <dbl> <db1> 


dbl (6): visit number, household id, enerc_kcal_w, dry w, water w, fat w 


i Specify the column types or set ‘show_col_types = FALSE* to quiet this 


22 


## 1 I 348 2268. 548. 4219. 78.4 
## 2 1 354 2775. 600. 2310.0 LIS 
## 3 1 53 3104. 646. 2808. 127. 
## 4 1 18 2802. 620. 3457. 87.4 
## 5 1 211 1298. 269. 2584. 47.8 


Here we will use the visit number variable to create new variable for energy, water, fat 
and dry content of foods recorded at different visits: 


diet diversity vietnam wide <= 


diet diversity vietnam_long %>% 
pivot wider ( 
Names: rom Vi Sit mums cia, 
values from = c(enerc_kcal_w, dry _w, water _w, fat_w) 


diet diversity vietnam wide 


# A tibble: 5 x 9 
household id enerc_ kcal w_ 1 enerc_ kcal w 2 dry w 1 dry w 2 


<db1> <db1> <db1> <db1> <db1> 
1 348 2268. 1386. 548. 281. 
2 354 2175- 1240. 600. 284. 
3 53 3104. 2075. 646. 451. 
4 18 2802. 2146. 620. 807. 
5 211 1298. 1191. 269. 288. 
# .. with 4 more variables: water w 1 <dbl>, 
# water w 2 <dbl>, fat w_1 <dbl>, fat _w 2 <dbl> 


You can see that the values of the variable visit number (1 or 2) are added to the values’ 
names (energy kcal _w,dry vw, fat_w, water w) to create the new columns 
representing different group data: for example, water w 1 andwater w 2. We have 
pivoted to wide format all of these variables at the same time. Now each weight measure 
per visit is represented as a Single variable (i.e. column) in the dataset. 


With this format, it is easy to sum together the energy intake per household for example: 


diet diversity vietnam wide %>% 


sellect (household id, enerc_kcal_w_1, enerc kcal _w_2) %>% 
MUEAce(tTotalwencngy, keally— enerci kele senenenkcaliiw a2) s5 
arrange (household id) 


# A tibble: 5 x 4 
household_id enerc_kcal_w_1 enerc_kcal_ w 2 
<db1> <db1> <db1> 
1 14 1040. 1663. 
2 17 2100. 1286. 
3 18 2802. 2146. 
4 22 3187. 1582. 


23 


## 5 24 2359. 2026. 
## # .. with 1 more variable: total energy kcal <dbl> 


However, you could get something similar in the long format: 
diet diversity vietnam_long %>% 


group by(household_id) %>% 
summarize (total energy = sum(enerc_kcal_w)) 


# A tibble: 5 x 2 
household id total_energy 
<db1> <db1> 
1 14 2704. 
2 17 3386. 
3 18 4948. 
4 22 4769. 
5 24 4385. 


Take tb visits renamed dataset that we manipulated above and pivot 
PRACTICE it back to its wide format. 


i i? Oeo verene ial <— 
(in RMD) # coim cresi ronaned aes 
# pivot wider ( ) 


Wrap Up! 


You data wrangling skills have just been enhanced with advanced pivoting. This skill will 
often prove essential when handling real world data. | have no doubt you will soon put it 
into practice. It is also essential, as we have seen, for plotting. So | hope pivoting will be of 
use not only for your wrangling, but also for your plotting tasks. 


Contributors 


The following team members contributed to this lesson: 


KENE DAVID NWOSU 


Data analyst, the GRAPH Network 


24 


Passionate about world improvement 


LAURE VANCAUWENBERGHE 


Data analyst, the GRAPH Network 
A firm believer in science for good, striving to ally programming, health 
and education 


References 


Lesson notes | Intro to ggplot2 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


(NOOO esenea eh Seat wie ae oe ahaha dee eae eye oes bk ete sa 
Learnmo CO CIING cna aie aela We ees wes ES OG OR EE Od A OE OOS ODE REESE EE EGE 
PC IGES ns bat 385s Ghd £4 hey aoe oe O44 ees a oe Fe eee 49g 4a ge a 
Measles outbreaks in Niger 145 4.c445e¢ 0400 ee ve dnb bee aa tiea ded eee dard eaceweey’s 
Tie Beri 9k os be cae hae sheer oe dee che de eehoten geste mbonaeeeearegs 
Tie layered Grammar Of Grapes «0.606400 4.44 49-449 dae Soe See ee Row es anta iE G 
Working through: te essential layers son aad caren ed 64 Se He + odd Sede eee tee cee ee wes 
Building. es ggpleot (Q WN Steps 2624s toe awdederes dtd ne He neti hE band boa bat ohne dhs oe 
Modifying the layers oa .t4 scant eadedee $o-25).2 bhi Ghecbhddescesd~ehederiaasocaghad’s 
Changing aesthetemoppinNgS sares arrita eos ect xt Kee EAD ENTE aT eee eee kw IA 
Changing geom + FUNCTIONS: ¢4. 004094 oe seawere ay eens 25:4 450 RA RARER EE ONE 
Additional aesthetic mappings inside aes () ...sc0.essiceaveavodsantareesadaades 
Fixed aesthetics outside acs () 6. eee ee 
PCIE RE GG NAVAS 65 25 rina Eee eee ae He ONE a EER EES ESE Ae EG REE DEER 
Learning GULCOMNES.nc ocew cess exes de ne ee + Oe AE Wh EOE SE Eee Gh SR AE eee 


Introduction 


Welcome to The GRAPH Courses’ Data Visualization course! 


We will focus on learning how to use the {ggplot2} package to produce high quality 
visualizations in R. 


tidyversé 


{ggplot2} is one of the core packages of the {tidyverse} metapackage. It is the most 
popular R package for data visualization. 


Let's dive in! 


Learning objectives 


By the end of this lesson you should be able to: 


1. Recall and explain how the {ggplot2} package for data visualization is based on a 
theoretical framework called the grammar of graphics. 


2. Name and describe the 3 essential components required for building a graph: data, 
aesthetics, and geometries. 


3. Write code to build a complete ggplot graphic by correctly supplying the 3 
essential layers to the ggplot() function. 


4. Create different types of plots such as scatter plots, line graphs, and bar graphs. 
5. Add or modify visual elements of a plot such as color and size. 


6. Distinguish between between aesthetic mappings and fixed aesthetics, and how to 
apply them. 


k Build a data 
STeRPiece 


Illustration by Allison Horst 


oo 


Packages 


The {tidyverse} meta package includes {ggplot2}, so we don’t need to add it separately. 
The {here} package will help us correctly reference file paths. 


# Load packages 
pacman::p load(tidyverse, 
here) 


Measles outbreaks in Niger 


In this lesson, we will explore patterns of measles outbreaks in Niger. 
Measles is a highly infectious virus spread by airborne respiratory droplets. 
[Slide presentation about geography] 


Since it is transmitted through direct contact, population density is an important driver 
of measles dynamics. 


The nigerm dataset 


We will be creating plots with a dataset of weekly reported measles cases at the region 
level in Niger. 


These data were collected by the Ministry of Health of Niger, from 1 Jan 1995 to 31 Dec 
2005. 


To get started, let’s first load the (preprocessed) data set: 


# Import data frame to RStudio Environment 
load (here ("data/clean/nigerm cases rgn.RData") ) 


Take a moment to browse through the data: 


# Print Niger measles (nigerm) data frame 
nigerm 


The nigerm data frame has 4 variables (or columns): 


1. year: Calendar year (ranges from 1995 to 2005) 


3. region: Region in which the cases were recorded (see figure below) 


4. cases: Number of measles cases reported 


0 200 400 600 800 km LIBYA 
— E a | 
ALGERIA 


E Regions 


Departments 
+ Regional capitals 


MALI 


Tahoua CHAD 


Te Lake Chad 


Administrative divisions of Niger: Districts and Regions 


Several papers have investigated these trends, linking measles to human activity, 
migration, and seasonality. 


PROCEEDINGS 

JOURNAL OF THE ROYAL SOCIETY OF THE ROYAL SOCIETY B 
INTERFACE BIOLOGICAL SCIENCES 

B More 25 Sections B More {Z Sections OF Get Access 

@ Open Access M) checkifor updates @ Restricted access M) Chock for updates 

Research articles Research articles 

Investigating persistent measles dynamics in Niger Rural=urban gradient in seasonal forcing of measles 

and associations with rainfall transmission in Niger 

Alexandre Blake ©], Ali Djibo, Ousmane Guindo and Nita Bharti) Matthew J. Ferrari ©, Ali Djibo, Rebecca F. Grais, Nita Bharti, Bryan T. Grenfell and 


Ottar N. Bjornstad 
Published: 28 April 2010 https://doi.orq/10.1098/rspb.2010.0536 


Published: 26 August 2020 https://doi.org/10.1098/rsif.2020.0480 


Research articles that have used this dataset, and analyzed it in R! 


These studies are much more complex than what we will do there, but let’s see if we can 
find any patterns even with basic exploratory data visualization. 


We can get some information about patterns in this data by inspecting summary 
Statistics given by the summary () function: 


summary (nigerm) 


year week region cases 
Min. 21995 Min. $ 1400 Agadez : 572 Min. : 0.0 
lst Qu.:1997 Ist. Ouwti3.75 Diffa 2 572 lst Qu.: 1.0 
Median :2000 Median :26.50 Dosso- : 572 Median : 16.0 
Mean :2000 Mean 326.490 Maradi : 572 Mean : 100.3 
3rd Qu.:2003 3¥d..00.239.25 Niamey : 572 Sra Quz 86.0 
Max. 22005 Max. 252.00 Tahoua : 572 Max. :1887.0 
(Other) :1144 


This gives us values for the maximum, minimum, and quartiles of each numeric variable, 
and the number of observations (rows) for each region. This is summary useful, but it 
omits a large amount information contained in the dataset. 


Keep in mind that summary statistics can be highly misleading, and a simple plot can 
reveal a lot more. 


The easiest and clearest way to analyze patterns from this dataset is to visualize it! 


The best way to do this in R is with {ggplot2}. So let's see how that works. 


The layered Grammar of Graphics 


The gg in ggplot is short for “grammar of graphics”, which is the data visualization 
philosophy that {ggplot2} is based on. 


The grammar of graphics is a theoretical framework which deconstructs the process of 
producing a graph. 


Think of how we construct and form sentences in written and spoken languages by 
combining different elements, like nouns, verbs, articles, subjects, objects, etc. We can't 
just combine these elements in any arbitrary order; we must do so following a set of rules 
known as a linguistic grammar. 


Similarly, the grammar of graphics (GG) defines a set of rules for constructing graphics 
by combining different types of elements, known as /ayers. 


The Grammar of Graphics layers have specific names that you will see throughout the 
course. 


The three layers at the bottom of this figure - data, aesthetics, and geometries - are 
required for building any plot. 


Let’s define what they mean: 


1. data: the dataset containing the variables of interest. 


| Grammer of Graphics | 


xy, 3902, 29, 9, 

4756, x, 72, 633, 
647, 617, 827, 3, 
1, 21, 45, tyu, 6, 
987, 457, 283, 8, 
4,5, 671, 34, 67, 
x, 981, hu, 89, 5 


2. aesthetics: things we can see that visually communicate information in our data. 


Grammer of Graphics 


~~ 


3. geometry: the geometric shape used to represent data in a plot: points, lines, bars, 
etc. 


Grammer of Graphics 


ii: 


SS 


You might be wondering why we wrote data, geom, and aes in a computer code type 
font. You'll see very shortly that we use these terms in R code to represent GG layers. 


CHALLENGE 
K 


The terms and syntax used for ggplot functions, arguments, and layers 
can be hard to keep up with at first, but as you gain experience using 


CHALLENGE : : 
K these terms to make plots in R, you will become fluent in no time. 


Ath 


Working through the essential layers 


In this section, we will work towards a first plot with {ggplot2}. It will be a scatter plot 
using data from nigerm. 


For easier plotting in this lesson, we will use a smaller subsets of the nigerm data frame 
at a time. 


First let’s create one called nigerm96, which only contains measles case data for the year 
1996. Running the code below will create nigerm96 and add it to your RStudio 
Environment: 


# Create nigerm96 data frame 

nigerm96 <- nigerm %>% 
filter(year == 1996) S>% # filter to only include rows from 1996 
select (-year) # remove the year column 


REMINDER The select () and filter () functions are part of the {dplyr} package 
for data manipulation, which is a core package of the {tidyverse}. These 
topics are covered in the Data Wrangling course. See The GRAPH Courses 
website for more. 


Let's look at our new dataframe, nigerm96: 


# Print nigerm96 
nigerm96 


Building a ggplot() in steps 


Time to start building a ggplot in increments! We'll do this by starting with a blank 
canvas and then adding one layer at a time. 


Step 0: Call the ggplot() function 


# Call the ‘ggplot()* function 


ggplot () 


As you can see, this gives us nothing but a blank canvas. But not to worry, we're about to 
add some more elements. 


Step 1: Provide data 


The first input we need to supply the ggplot () function is the data layer (i.e., a data 
frame), by filling in the data argument (data = DF NAME): 


# Data layer 
ggplot (data = nigerm96) # what data to use 


This gives us blank plot again, since we've only supplied one out of the three inputs 
required for a complete graphic. Next we need to assign variables to aesthetic mappings. 


Step 2: Define the variables 


What should we plot on our axes? Let’s say we want to make an epidemic time series plot. 
To do that, we plot time (in weeks) on the x-axis, and disease incidence (number of 
reported cases) on the y-axis. In ggplot-speak, we are mapping the variable cases to the 
x aesthetic, and week to the y aesthetic. 


Let's tell ggplot () which variables to to plot on the aesthetics layer with a mapping 
argument, using this syntax: mapping = aes(x = VARI, y = VAR2). 


# Aesthetics layer: x and y position 
ggplot (data = nigerm96, # what data to use 
mapping = aes ( # supply a mapping in the form of an 'aesthetic' 
x = week, # which variable to map onto the x-axis 
y = cases) # which variable to map onto the y-axis 


1500 - 


1000 - 


cases 


500 - 


0 10 20 30 40 50 
week 


There's still no data plotted, but the axis scales, titles, and labels are present. The x-axis 
marks weeks of the year from 1 to 52, and the y-axis shows that the number of weekly 
reported cases per region ranges from O to around 2000. 


The plot is still lacking the required geometry layer. 


KEYPOINT aes() stands for aesthetics - things we can see. Variables are always 
inside the aes () function, which in return is inside a ggplot(). Take a 
®© moment to observe the double closing brackets ) ) - the first one belongs 
= to aes (), the second one to ggplot(). 


Step 3: Specify which type of plot to create 


Finally, we add a geometry layer using a geom_* function. This determines which 
geometric objects - or visual markers - should be used to map the data. 


Since we are looking at the relationship of two numerical variables, it makes sense to use 
a scatter plot. The geometric objects used to represent data on scatter plots are points, 
and the geom_* function for scatter plots is conveniently named geom_point (). We'll 
add this function as new layer using a + sign: 


13 


# Geometries layer: points 
ggplot (data = nigerm96, # what data to use 


mapping = aes ( # define mapping 
x = week, # which variable to map onto the x-axis 
y = cases)) + # which variable to map onto the y-axis 
# 


geom point () add a geom of type point (for scatter plot) 


1500 - e ° 
e 
a 
e 
2 1000- a 
© o e,’ ef ° o 
o e 
e 
e i) e e 
eo ° e 
° ee A A 
500 - 8 . e > = 
ee 9 e 
oy Sale 
° — ° e —— eee 
8 @e® ‘a e eee? 
gie’ ooe8egee ° e., ome. ce? o o 
9935 Coes cece Y 08985 QO na 8283. e°e OK e90 
o- PeceocSecccrstgsooootoo sss heecesSeeseseeeessiesicces 
0 10 20 30 40 50 


week 


Points have been added, and this is now a complete scatter plot! There are 8 points per 
week, representing each of the 8 regions (but at this point we cannot tell which point is 
from which region). 


REMINDER a oe 
The aesthetic function is nested inside the ggplot () function, so be sure 


to close the brackets for both functions before adding the + sign for the 
geom_ * function, or your code will not run correctly. 


It’s your turn to practice plotting with ggplot ()! For practice exercises in this lesson, you 
will be using a different subset of nigerm called nigerm04, which contains only data from 
the year 2004: 


Plotting with a different set of data will also allow you to explore if the patterns we see for 
1996 is also true for 2004. 


PRACTICE l P E 
A Using the nigerm04 data frame, write ggplot code that will create a 


scatter plot displaying the relationship between cases on the y-axis and 
(in RMD) week on the x-axis. 


Modifying the layers 


Generally speaking, the grammar of graphics allows for a high degree of customization of 
plots and also a consistent framework for easily updating and modifying them. 


We can tinker with our existing code to switch up the data, aesthetics, and geometry 
inputs supplied to ggplot (), and create variations of the original plot. In fact, you've 
already done this by changing the dataset from nigerm96 to nigerm04 in the practice 
question. 


Similarly, the aesthetics and geometry inputs can also be changed to create different 
visualizations. In the next few sections we will take the scatter plot we built in the 
previous section, and make incremental changes to modify different elements of the 
Original code. 


Changing aesthetic mappings 
We created a scatter plot of cases vs week for nigerm96 with this code: 


ggplot (data = nigerm96, 
mapping = aes(x = week, 
y = cases)) + 


geom point () 


(J 
e e 
e 
e 
e 
1500 - e, ° 
e 
e 
e 
@ 1000- z 
© ° e,? ef ° o 
o e 
e 
o ° e e 
eo ° e 
e ee °. ° 
500 - 8 e ° °. 
Coe e pA 
e 
od 2 tale 
° eee ° e °” eoo 
38 0o? ee ee e e e22?’ 
e 
THERTORE fa an eas ote. Rue Aa 5 
e eore e e e 
o- Pesecdecscrstgcecootoo sss hessessesseseceesstesscces 
0 10 20 30 40 50 
week 


If we copy the same code and change just one thing - by replacing the x variable week 
(numerical) with region (categorical) - we get what's called a strip plot: 


ggplot (data = nigerm96, 
mapping = aes(x = region, # change which variable to map on the x- 
axis 


y cases)) + 


geom point () 


1500 - 
e 
e 
® 1000- + 
8 i 8 
9 e 
e 
° 8 8 
e H e 
& 
500 - e 
l ! ; 
l : 
e 


0- i | | 


Agadez Diffa Dosso Maradi Niamey Tahoua Tillaberi Zinder 
region 


While the y-axis values of the points are the same as before, their x-axis mappings have 
changed significantly. They are now mapped to 8 separate positions along the x-axis, each 
corresponding to a discrete category of the region variable. 


Changing geom * functions 


Similarly, we can modify the geometry layer to create a different type of plot, while still 
using the same aesthetic mappings. 


Basic One variable 


AO mA AE OSF E 


blank curve path polygon rect ribbon "line area density dotplot freqpoly histogram bar 


Two variables 


+? “3 = ° i 
we GA oe A P E p & = 
wit Bich Bat Wa Ag Bi f a + 
jitter label point quantile smooth text rug hex density2d bin2d violin boxplot dotplot 
Error Three variables Map 
emi CEE + 
crossbar errorbar linerange pointrange contour raster tile map 


{ggplot2} has a variety of different geom _* functions and geometric objects which you 
can use to visualize your data. Here are some examples of different types of geoms that 
can be used with ggplot (). 


Let’s copy and paste the original scatter plot code once again, but this time we will 
replace the geom_* function instead of the x aesthetic. If we change geom point () to 
geom_col(), we get a bar plot (sometimes called a column chart): 


ggplot (data = nigerm96, 
mapping = aes(x = week, 
y = cases)) + 
geom_col() # declare that we want a bar plot 


cases 


3000 - 


2000 - 


1000 - 


0- ii 


0 


10 20 30 40 50 
week 


Again, the rest of the code is still the same - we just changed the key word of the geom * 
function. However, the plot is significantly different that either the scatter plot or the 
strip plot. 


Notice that the y-axis has been rescaled. The height of each bar represents the 
cumulative number of weekly cases, i.e, the total number of cases reported from all eight 
regions that week, rather than showing 8 separate data points for each region. 


Error? 


Not all plot types are interchangeable. Using a geom * function that is 
not compatible with the variables you defined in aes () will give you an 
error. For example, let’s replace geom point () with geom histogram () 
instead: 


ggplot (data = nigerm96, 
mapping = aes (x = week, 
y = cases)) + 
geom histogram () 


This is because a histogram shows the distribution of one numerical 
variable. ggplot () can’t map two variables to both the x and y-axis 


Bee 8 8 ee ee ge a a 


Error? 


PRACTICE 
A 


Ema 


Use the nigerm04 data frame to create a bar plot of weekly cases with 
the geom _ col () function. Map cases on the y-axis and week on the x- 


Additional aesthetic mappings inside aes () 


So far, we have only mapped variables to the x and y aesthetic attributes. We can also 
map variables to other aesthetics like color, size, or shape. 


position shape size 


eE QA. 0o00 


color line width line type 


Common aesthetic attributes used in ggplot graphics. 


Let’s return to our original scatter plot (cases vs week): 


ggplot (data = nigerm96, 
mapping = aes (x = week, 
y = cases)) + 


geom point () 


20 


1500 - 


cases 


1000 - 


500 - 


ee e 
(J 
e 
e (J 
e (J 
e 
e e 
e e 
d e 
e © 0o 
e 
e 
a. e oe e8s 
° 983 ee . e 
cosoodssstsstes 
20 30 
week 


There are other aesthetics we can add, like color or size. 


PRO 


TIP 


X 
X 


geom 
x= W: 
y= C 

color=R 
size = C 


oO 


0 1 2 3 4 
coordinate 
system 


40 50 


To see the full list of aesthetics that can be used with a particular geom_* 
PRO TIP function look it up the function documentation. You can do this by 
x pressing F1 on a function, e.g., geom point () to open the Help tab, and 
scroll down to the “Aesthetics” section. If F1 is hard to summon on your 
keyboard, type and run ?geom_ point in your Console tab. 


wx, 


Let's add color to our scatter plot. We can map the categorical variable region to the 
color aesthetic. We can do this by modifying the original code to add a new argument 
inside mapping = aes (). Let's see what happens when we add color = region inside 
aes(): 


ggplot (data = nigerm96, 
mapping = aes(x = week, 
Vooecases 
color = region)) + # use a different color for each 


region 
geom point () 


1500 - e ° 
e 
region 
® Agadez 
e è Diffa 
T ®© Dosso 
a] e 
8 1000 - = 3 , 
S e sue seit o Maradi 
e e® Niamey 
° 
a e9 e ° ® Tahoua 
ee ° ° z ; 
e ene r 5 e Tillaberi 
500- e° e Zinder 
3. e ames 
ee e 
ee œ . ° e 
°° FOI == 8 °” eee 
e® ee 
rer ° eiaa 
ppi” eeetsttss ssa o ao “eo?” 
000o? 303 ee 08985 T n ak 
o- Becccseccctedgeeeeetan Hiki siio si saviteestece 
0 10 20 30 40 50 
week 


Now we have a colorful scatter plot! Each point is colored according to the region it 
belongs to. This allows us to better distinguish between regions. 


22 


Note that ggplot () automatically provides a color legend on the left. 


The colors are from {ggplot2}'s default rainbow color palette. In later 
lessons we will learn how to customize color scales and palettes, 
including making figures colorblind-friendly. 


By examining the color patterns in the plot, you can make out the classic bell-shaped 
epidemic curves showing a rise and fall in measles incidence in each region. 


Zinder had the largest number of cases and the steepest epidemic curve, followed by 
Maradi and Niamey. 


While the colorful plot provides more insight into measles patterns at the regional level 
than the scatter plot with no color mapping, this graph still looks busy and is not the 
most intuitive to read. A different plot type could help with this. 


Next we will try a bar plot, then a line graph. 


Let’s try the same color = region aesthetic mapping with geom_col() instead: 


ggplot (data = nigerm96, 
mapping = aes(x = week, 
y = cases, 
color = region)) + # use a different outline color for 


each region 
geom_col () 


3000 - 

region 
Agadez 
Diffa 
Dosso 


2000 - , 
Maradi 


cases 


Niamey 


Tahoua 


pE] Tillaberi 
G Zinder 


1000 - 


week 


This gives us a stacked bar plot, where the bars are divided into smaller sections. This 
shows us the proportional contribution of individual regions (i.e., the height or length of 
each subsection represents how much each region contributes to the total number of 
cases that week). 


The stacked bar plot here is outlined by color. This is because the color aesthetic in 
{ggplot2} generally refers to the border around a shape. This did not apply to the default 
shapes in our scatter plot created with geom point () because they are solid dots (not 
hollow), but you can see that it does apply to the bars in a bar chart created geom col (). 
However, the grey filling is not very pretty. 


We might want to color the inside of the bars instead. This is done by mapping our 
variable to the fill aesthetic. We can copy the code above and simply change color to 
fill inside aes(): 


ggplot (data = nigerm96, 
mapping = aes(x = week, 
Voaecases, 
fill = region)) + # use a different fill color for 


each region 
geom_col () 


24 


cases 


3000 - 
region 
E Agadez 
DD oira 
Dosso 
2090: - Maradi 
m Niamey 
E Tahoua 
B Titaber 
1000 - | A Zinder 
R il _ 


0 10 20 30 40 50 
week 


Voila! The inside of the bars are now filled with colors. 


Now practice using the color aesthetic mapping with a new plot type: line graphs. Line 
graphs are generally considered one of the best plot types for time series data. 


PRACTICE Use the nigerm04 data frame to create a line graph of weekly cases, 
colored by region. Map cases on the y-axis, week on the x-axis, and 
region to color. The geom_* function for a line graph is called 


(in RMD) geom_line(). 


Fixed aesthetics outside aes () 


It is very important to understand the difference between aesthetic mappings and fixed 
aesthetics. The main aesthetics in ggplot are: x, y, color, fill, and size, and any of 
these could be either a mapping or a fixed value. This depends on whether they appear 
inside or outside the aes () function. 


When we apply an aesthetic to modify the geometric objects according to a variable (e.g., 
the color of points changes according to the region variable), that’s an aesthetic 


mapping. This must always be defined insidemapping = aes (), like we just did in 
previous examples. 


But if you want to apply a visual modification to a//the geometric objects evenly (e.g., 
manually change the color of all points to be one color), that’s a fixed aesthetic. We must 
set fixed aesthetics to a constant value outside mapping = aes() and directly inside the 
geom_* function - e.g., geom point (color = "COLOR NAME"). 


Here let’s change the color of all the points in our scatter plot to blue: 


ggplot (data = nigerm96, 
mapping = aes(x = week, 


y = cases)) + 
Geom pnnt (Color Violets) # use the same color for all points 
e 
e 
e ° 
e 
e 
e 
1500 - ae A 
® 
® 
2 1000- a 
© © e,’ °% ° o 
© e 
G 
e °}? A e 
ee ° e 
è ® e e 
° 
500 - 8 e ~ e; 
ee , e 
= ne e e E 
e ote e js = 
38 cog? e ee °. wale 
THH RETTER 33758083 ee aad Tit t Pe oe° od . 
e e 
o- Sesccdeccsrstssocootood seg heesesseeseseeeesseecscees 
0 10 20 30 40 50 


week 


This colors each point with the same R color (“blue”). In this plot, the color aesthetic does 
not represent any values from the data frame. Note that the color names in R are 
character strings, so it needs to go inside quotation marks. 


SIDE NOTE 


feeueoeaoe| 


SIDE NOTE 


bs | 
| 
- If you're curious, run colors () in your console to see all possible choice l 
I of colors in R! To find out exactly how many options that is, try running g 
= colors O s>s length(). = 

i 

a 


i 
eee ee N 
Now let's add a fixed aesthetic called size. The default line width used by geom_line() is 
0.5 mm, which looks like this: 


ggplot (data = nigerm96, 
mapping = aes(x = week, 
y = cases, 
color = region)) + 
geom_line() 


1500 - 
region 

— Agadez 
— Diffa 
— Dosso 


1000 - 
— Maradi 


cases 


— Niamey 
— Tahoua 
—— Tillaberi 


500 - — Zinder 


To make all of the lines in our figure a little thicker, let’s fix this aesthetic at 1 mm. We do 
this by adding size = 1 inside the geom line () function: 


ggplot (data = nigerm96, 
mapping = aes(x = week, 
y = cases, 
color = region)) + 
geom linel(size 1) 


1500 - 
region 
=== Agadez 
= Diffa 
= Dosso 
1000 - a 


Maradi 


cases 


== Niamey 
== Tahoua 
= Tillaberi 


500 - === Zinder 


0 10 20 30 40 50 
week 


All the lines in the plot have been made thicker, and the line width is set to a constant 
value of 1 mm. Note that here the value of size is numeric, so it should not be in quotation 
marks. 


Remember that fixed aesthetics are manually set to constant value (as 
opposed to a variable from the data), and goes directly in the geom * 
function, not inside aes (). If you try to put a fixed aesthetic in aes (), 
you might get a weird result. For example, let’s try moving the size = 1 
WATCH OUT aesthetic from geom line () to aes() to see how it can go wrong: 


ggplot (data = nigerm96, 
mapping = aes(x = week, 


y = EEEE 
color = region, 
size =1)) + # INCORRECT 
placement 
geom_line() 


size 
1500 - 
mi 
region 
— Agadez 


1000 - — Diffa 


cases 


— Dosso 
WATCH OUT — Maradi 
— Niamey 
— Tahoua 
500 - 
— Tillaberi 


— Zinder 


week 


aes () iS a mapping function that modifies plots based on variables from 
the data. Since there is no variable called “1° in the nigerm96 data frame, 
aes () Cannot process or map this aesthetic correctly. 


Practice using fill asa fixed aesthetic for a bar plot. 


PRACTICE 
Use the nigerm04 data frame to create a bar graph of weekly cases, and 


fill all bars with the same color. Map cases on the y-axis, week on the x- 
(in RMD) @Xis, and fix the color aesthetic of the bars to the R color “hotpink”. 


Additional GG layers 


In this lesson, we kept things simple and only worked with the three required layers. As 
you Start to delve deeper into plotting with {ggplot2}, you'll start to encounter the other 
layers more frequently. 


Soon you'll be able to create more complex plots, like this one: 


Seasonal patterns of measles incidence in Niger 
Weekly reported at region level (1995-2005) 


1995 


1998 


2002 


T T 7 T T T 
0 10 20 30 40 50 


Number of cases reported 


T T T T T T T T T T T -A 
O 10 20 30 40 50 0 10 20 30 40 50 O 10 20 30 40 50 
Week of the year 
Source: doi:10.5061/dryad.1jwstqjrd 


RECAP : ; ; 
adapt it to create different ggplot graphics: 


ggplot (data = DF NAME, 
mapping = aes(AES1 = VARI, 
AES2 
AES3 = VAR3, 
DOGE 


ll 
< 
D 
w 
N 


geom_FUCNTION () 


Pees BB SB SS SSS SSE aa aa aa a aaa | 


bs Region 


To build a complete ggplot, you must first supply a data frame using the 
data argument of ggplot (), and define variables and map them to 
aesthetics inside aes () using the mapping argument of ggplot(). 
start a new layer with a + sign and specify the type of plot you want using 
an appropriate geom_* function. You can copy this code template and 


Agadez 
Diffa 
Dosso 
Maradi 
Niamey 


Tahoua 


— Tillaberi 


Zinder 


se A A A ee ce a ee | 


Then 


i se ee ee a 


Learning outcomes 


30 


. You can recall and explain how the {ggplot2} package for data visualization is based 


on a theoretical framework called the grammar of graphics. 


. You can name and describe the 3 essential layers for building a graph: data, 


aesthetics, and geometries. 


. You can write code to build a complete ggplot graphic by correctly supplying the 


3 essential layers to the ggplot() function. 


. You can create different types of plots such as scatter plots, line graphs, and bar 


graphs. 


. You can add or modify aesthetics of a plot such as the color, and size. 


Contributors 


The following team members contributed to this lesson: 


JOY VAZ 


R Developer and Instructor, the GRAPH Network 
Loves doing science and teaching science 


References 


Some material in this lesson was adapted from the following sources: 


31 


Blake, Alexandre, Ali Djibo, Ousmane Guindo, and Nita Bharti. 2020. “Investigating 
Persistent Measles Dynamics in Niger and Associations with Rainfall.” Journal of The 
Royal Society Interface 17 (169): 20200480. https://doi.org/10.1098/rsif.2020.0480. 


Cmprince. Administrative divisions of Niger: Departments and Regions. 29 October 
2017. Wikimedia Commons. Accessed October 14, 2022. https://commons.wikimedia 
.org/wiki/File:Niger_administrative_divisions.svg 


DeBruine, Lisa, and Dale Barr. 2022. Chapter 3 Data Visualisation | Data Skills for 
Reproducible Research. https://psyteachr.github.io/reprores-v3/ggplot.html. 


Franke, Michael. n.d. 6 Data Visualization | An Introduction to Data Analysis. 
Accessed October 12, 2022. https://michael-franke.github.io/intro-data-analysis/Chap 
-02-02-visualization.html. 


Geography Now, dir. 2019. Geography Now! NIGER. https://www.youtube.com/watch 
?v=AHeq99pojLo. 


Giroux-Bougard, Xavier, Maxwell Farrell, Amanda Winegardner, Etienne Low-Decarie 
and Monica Granados. 2020. Workshop 3: Introduction to Data Visualisation with 
Ggplot2. http://r.qcbs.ca/workshop03/book-en/. 


Ismay, Chester, and Albert Y. Kim. 2022. A ModernDive into R and the Tidyverse. 
https://moderndive.com/. 


Kabacoff, Rob. 2020. Data Visualization with R. https://rkabacoff.github.io/datavis/. 
Lisa DeBruine. 2020. Basic Plots. https:/Awww.youtube.com/watch?v=tOFQFPRgZ3M. 


Pius, Ewen Harrison and Riinu. n.d. R for Health Data Science. Accessed October 11, 
2022. https://argoshare.is.ed.ac.uk/healthyr_book/. 


Prabhakaran, Selva. 2016. “How to Make Any Plot in Ggplot2? | Ggplot2 Tutorial.” 
2016. http://r-statistics.co/ggplot2-Tutorial-With-R.html. 


This work is licensed under the Creative Commons Attribution Share Alike license. 


32 


Lesson notes | Scatter plots and smoothing 
lines 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


Ne OOUCNOR vinane mene eke ee Sun eh in as OE alee sea eye Ou Bid estes 
LEAVING OBES: ison hao ees O45 40S 2-5 58 edb 4OSs A Od OOS OEE EES oa hede 
Childhood diarrheal diseases in Mall, .54¢2440ed0as005 4b 540% eGdee be bE es £45454 44 4 a 
Scatter plöts VIA Geom POLNE) eors asosa dese dee nee pe ened Hew eRe ES aOR EDR e ees 
AestheticmodiNcatiðNS: 5.0 scescene bee ombde 6 tek Oe eSoe Sebo eRe ee Se Seeders Aue ae 

Mapping Gala tC aesthetics ccc soeheenengess tag 4) o4edaee deed aebe keane oe AE E 

SUING MEd eSEE S 4.544.4440 4 raa Hou wmean ae eee ek Gee had Ne he hee Renee 
Adding a trend line cae oe ere wee w dyn neue bed oS tee nde DE SAE DHE RHEE dR EDS REE 
SUNNY. yo a5 468 y 4 4 OE Sh ee BS 4506594254665 eked babe deans oe ee 


Introduction 


Scatter plots - which are sometimes called bivariate plots - allow you to visualize the 
relationship between two numerical variables. 


They are among the most commonly used plots because they can provide an immediate 
way to see how one numerical variable varies against another. 


Scatter plots can also display multiple relationships by mapping additional variable to 
aesthetic properties, such as color of the points. 


Trends and relationships in a scatter plot can be made clearer by adding a smoothing line 
over the points. 


We will use ggplot to do all that and more. Let’s get started! 


Learning Objectives 


1. You can visualize relationships between numerical variables using scatter plots with 
geom_point(). 

2. You can use color as an aesthetic argument to map variables from the dataset 
onto individual points. 

3. You can change the size, shape, color, fill, and opacity of geometric objects by 
setting fixed aesthetics. 

4. You can add a trend line to a scatter plot with geom_smooth (). 


Childhood diarrheal diseases in Mali 


We will be using data collected for a prospective observational study of acute diarrhea in 
children aged 0-59 months. The study was conducted in Mali and in early 2020. 


The full dataset can be obtained from Dryad, and the paper can be viewed here. 


VOCAB 
A prospective study watches for outcomes, such as the development of a 


disease, during the study period and relates this to other factors such as 
suspected risk or protection factors. 


Spend some time browsing through this dataset. Each row corresponds to one patient 
surveyed. There are demographic, physiological, clinical, socioeconomic, and geographic 
variables. 


We will begin by visualizing the relationship between the following two numerical 
variables: 


1. age_months: the patient's age in months on the horizontal x-axis and 
2. viral_load: the patient's viral load on the vertical y-axis 


Scatter plots via geom point () 


We will explore relationships between some numerical variables in the malidd data 
frame. 


We will now examine at and run the code that will create the desired scatter plot, while 
keeping in mind the GG framework. Let’s take a look at the code and break it down piece- 
by-piece. 


Remember that we specify the first two GG layers as arguments (i.e., inputs) within the 
ggplot() function: 


1. We provide the malidd data frame with the data argument, by inputting data = 
malidd. 

2. We define the variables to be plotted in the aesthetics function of the mapping 
argument, by inputting mapping = aes(x = age_months, y = viral_load). 
Specifically, the variable age_months is mapped to the x aesthetic, while the 
variable viral_load is mapped to the y aesthetic. 


We then add the geom_*() function on a new layer with a + sign. The geometric objects 
(i.e., shapes) needed for a scatter plot are points, so we add geom_point(). 


After running the following lines of code, you'll produce the scatter plot below: 


# Simple scatter plot of viral load vs age 
ggplot (data = malidd, 
mapping = aes(x = age _ months, 


We viralla) a: geom point () 


0.8- ° 
e ° 
e | 
ee 8 a be 
© © oo°o Pe © 
sie y e >, ° Se 
O65 e e e o 
e e ° 
0.6- °. $ Er 
e e e e 
e ee°e ° ° 
e 
eee’ 38 
Ee) e s 
© oof eo? 
2 e e e e e 
= e ° 
S 0.4- e i Hia e T 
> e o o $ e 
e ° e 
e 
= ` 
e e e 
° 
0.2- e e e 
e 
° e v e 
e D e 
° e e 
e e 
e 
0 10 20 30 40 50 
age_months 


This suggests that viral load generally decreases with age. 


PRACTICE 
E e Using the malidd data frame, create a scatter plot showing the 
relationship between age and height (height cm). 
(in RMD) 


Aesthetic modifications 


An aesthetic is a visual property of the geometric objects (geoms) in your plot. Aesthetics 
include things like the size, the shape, or the color of your points. You can display a point 
in different ways by changing the values of its aesthetic properties. 


Remember, there are two methods for changing the aesthetic properties of your geoms 
(in this case, points). 


1. You can convey information about your data by mapping the variables in your 
dataset to aesthetics in your plot. For this method, you use aes () in the mapping 
argument to associate the name of the aesthetic with a variable to display. 


2. You can also set the aesthetic properties of your geoms manually. Here the aesthetic 
doesn’t convey information about a variable, but only changes the appearance of 
the plot. To change an aesthetic manually, you set the aesthetic by name as an 


argument of your geom *() function; i.e. it goes outside of aes (). 


Mapping data to aesthetics 


In addition to mapping variables to the x and y axes like with did above, variables can be 
mapped to the color, shape, size, opacity, and other visual characteristics of geoms. This 
allows groups of observations to be superimposed in a single graph. 


To map a variable to an aesthetic, associate the name of the aesthetic to the name of the 
variable inside aes (). This way, we can visualize a third variable to our simple two 
dimensional scatter plot by mapping it to a new aesthetic. 


For example, lets map height _cm to the colors of our points, to show us how height 
varies with age and viral load: 


ggplot (data = malidd, 
mapping = aes(x = age _ months, 
ie = Salieri WHeyetel) )) T 
geom point (mapping = aes(color = height _cm) ) 


0.8 - 


X 0 
e 
e © 
e e °3 bd © (J 
e> e © ooto Se © 
e @ 
OF è e e 8e 
(J 
E. es, 
. o e e 
(J 6 
° 3 33 J oS ° height_cm 
e%es o 100 
ne) e 3 A o 
oO ee e° 90 
e e e e e e e 
T 04- o fe o.. °’ ° = 
S e eels >. e 70 
Oo CP] © 60 
(J 
© e 
e e e 
0.2- ° e e $ 
£ e 
° e 
(J 5 e A 
e e e 
@ e 
(J 
0 10 20 30 40 50 
age_months 


We see that {ggplot2} has automatically assigned the values of our variable to an 


aesthetic, a process known as scaling. {ggplot2} will also add a legend that explains which 


levels correspond to which values. 


Here the points are colored by different shades of the same blue hue, with darker colors 
representing lower values. 


This shows us that height increases with age, as expected. 


Instead of a continuous variable like height _cm, we can also map a binary variable like 
breastfeeding, to show us the which children are breastfed and which ones are not: 


ggplot (data = malidd, 
mapping = aes(x = age months, 
y = viral _load)) + 
= aes(color = breastfeeding) ) 


geom point (mapping 


0.8 - 


; 8 
(J 
o © 
o e g ee e 
e . ee Se @ e (J ee 
o a © 8o 
eee bd 
e A bd e ee 
oe e B 
os- 9 § 8s 
e e I e s 
e =¢) : breastfeeding 
...3 38 1.00 
ne) e $ Eae 
oO ee e 0.75 
2 D ( e (J e @ 
T E |) e ° 
£ 0.4- + =) eo? + 0.50 
> e ee 3 e 
ss e 0.25 
e 0.00 
(J A e 
e e e 
0.2- ° - ° e 
z e 
° e 
e A e ` 
° ° e 
e e 
e 
0 10 20 30 40 50 
age_months 


We get the same gradual color scaling like with did with height. This communicates a 
continuum of values, rather than the two distinct values in our variable - O or 1. 


This is because of the data class of the breastfeeding variable in malidd: 


class (maliddSbreast feeding) 


## [1] "numeric" 


But even though binary variables are numerical, they represent two discrete possibilities. 
So the continuous color scaling in the plot above is not ideal. 


In cases like this, we add the function factor () around the breastfeeding variable to 
tell ggplot() to treat the variable as a factor. Let’s see what happens when we do that: 


ggplot (data = malidd, 
mapping = aes(x = age months, 
Ya =) vale clo ac) 
= aes(color = factor (breastfeeding) ) ) 


geom_ point (mapping 


0.8 - 


e°e , 
$e° 
as e? ° eee ese 
(J e 
Cas OA Be 
*s e 5 ee . 
e 
ie —t— - *%, 
ee OS o 
° e0’ ° ° 
ores 38 
f eo o i factor(breastfeeding) 
2 e o° o e o E o 
T 0.4- S 2% eo, ° e Pp 
> E a a 
e ° s 
e 
De A 
8 e A 
0.2 - s e e — 
e 
e e 
e s e 5 eS 
bd e @ 
a (J 
B 
0 10 20 30 40 50 
age_months 


When the variable is treated like a factor, the colors chosen are clearly distinguishable. 
With factors, {ggplot2} will automatically assign a unique level of the aesthetic (here a 
unique color) to each unique value of the variable. (this is what happened with the 
region variable of the nigerm dataframe that we use in the last lesson) 


This plot reveals a clear relationship between age and breastfeeding, as we might expect. 
Children are likely to stop breastfeeding around 20 months of age. In this study, no child 
at or above 25 months was being breastfed. 


Adding colors to the scatter plot allowed us to visualize a third variable in addition to the 
relationship between age and viral load. The third variable could be either discrete or 
continuous. 


e Using the malidd data frame, create a scatter plot showing the 
relationship between age and viral load, and map a third variable, 
freqrespi, to color: 


PRACTICE 
A 


E 


(in RMD) 


PRACTICE 
A 


A # Type and view your answer: 
age height fever <- "YOUR ANSWER HERE" 
age height _ fever 


(in RMD) 


Setting fixed aesthetics 


Aesthetic arguments set to a fixed value will be static, and the visual effect is not data- 
dependent. To add a fixed aesthetic, we add as a direct argument of the geom * () 
function; i.e., it goes outside of mapping = aes(). 


Let's look at some of the aesthetic arguments we can place directly within geom point () 
to make visual changes to the points in our scatter plot: 


e color - point color or point outline color 
e size - point size 

e alpha - point opacity 

e shape - point shape 


e fill -point fill color (only applies if the point has an outline) 


To use these options to create a more attractive scatter plot, you'll need to pick a value 
for each argument that makes sense for that aesthetic, as shown in the examples below. 


Changing color, size and alpha 


Let's change the color of the points to a fixed value by setting the color argument 
directly within geom _ point (). The color we choose must be a character string that R 
recognizes as a color. Here we will set the point colors to steel blue: 


# Modify original scatter plot by setting ‘color = steelblue“ 
ggplot (data = malidd, 
mapping = aes(x = age _ months, 


Yo = vba load) )) 
Geom porne (colorir steetibiues)) # set color 


0.8- s 
(J 
e°e 
e "f er 
° © © ooo °o 9 
cor N Ola & Be 
e e e e o o 2 A 
0.6- "| es, 
J e e 
e ee°e e ° 
e 
e.. 38 
ne) e ° 
© oof Pe lee 
2 e e ° o e e 
E 04- e 8s coe, oO ° 
> e e o 8 e 
e s e 
e 
(J s e 
e e e 
0.2- ° - ° e 
a 
e e e 
e A e A 
bd e e 
e e A 
0 10 20 30 40 50 
age_months 


In addition to changing the default color, now we will modify the size aesthetic of the 


points by assigning it to a fixed number (in millimeters). The default size is 1 mm, so let’s 
chose a larger value: 


# Set size to 2 mm by ading ‘size = 2° 
ggplot (data = malidd, 
mapping = aes(x = age months, 
Ay = vira leoa) 
geomeponnits (color = Nsteewbilwerl, # set color 
size = 2) # set size (mm) 


0.8- 
Ca) 
om b? ° cote %e © 
° 
®se0 © Be 
*¢ e ee 
0.6- e f es (J ® 
l i es . 
° as d ° 
e 
e°. e 
ke) ° $ pme 
© e e? 
=| e @ a bd ®@ 
> o © o e 
ens 
a 
e 
e (J 
e d e 
e 
0.2- e e r ® 
e (J 
e e 
e e o 
e ° e 
fo) o ° 
0 10 20 30 40 50 
age_months 


The alpha aesthetic controls the level of opacity of geoms. alpha is also numerical, and 


ranges from O (completely transparent) to the default of 1 (completely opaque). Let's 
make our points more transparent by reducing the opacity: 


# Set opacity to 75% by adding ‘alpha = 0.75° 
ggplot (data = malidd, 


mapping = aes(x = age months, 
Ay = vira load 
geom porine (color — steeple’, # set color 
size = 2, 


# set size (mm) 


alpha = 0.75) # set level of opacity 


0.8- 
0 o 
© x e fe) e e 
d fo) © 
on A OAO %e @ 
© o 
0o00 s Se 
(J g e o 
o ° 
0.6- L i °$ 8 
oO @ 
o = Ve © o 
© 
00o © 
ke) e o 
© ood R 
2 eoe 0o ° o e o 
E 0.4- -o 8e eee, ° o 
S © © o 8 e @ 
da ted 
o 
° 
g 
o z A z 
0.2- @ fe) s © — 
e] (J 
o e 
e © Wee 
° g g 
° © ° 
0 10 20 30 40 50 
age_months 


Now we can see where multiple points overlap. This is a useful parameter for scatter plots 
where there is overplotting. 


Remember, changing the color, size, or opacity of our points here is not conveying any 
information in the data - they are design choices we make to create prettier plots. 


PRACTICE 
A 


A e Create a scatter plot with the same variables as the previous 
example, but change the color of the points to cornflowerblue, 
$ increase the size of points to 3 mm and set the opacity to 60%. 
(in RMD) 


Changing shape and fill 


We can change the appearance of points in a scatter plot with the shape aesthetic. 


To change the shape of your geoms to a fixed value, set shape equal to a number 
corresponding to your desired shape. 


{ggplot2} will accept the following numbers: 


= x 
8 9 
* D 
13 14 
Si ma 
18 19 
+ © 
23 24 
> A 


Notice that some of the shapes are filled in with red. This indicates that objects 21-24 are 


sensitive to both color and { 


fill, but the others are only sensitive to color. 


First let’s modify our original scatterplot by changing the shapes to a something that can 


be filled in: 


# Set shape to fillable circles by adding ‘shape = 21° 


ggplot (data = malidd, 
mapping = aes(x = 


y = 
geom point (shape = 21) 


age months, 
viral load) ) 


+ 


# set shapes to display 


0.8 - Ə 
o 
[6] 6 o 
rome) 2 8 ° ane 
© 9 90°00 @Q © 
o o o 80 
a o o 
ON o rome) 
fe) o o 
0.6- o $ o8 
S a) o8 o 
o 0o0°o0 (a o 
o 
0°08 oe 
ne} o 8 F 
oO oo ae 
2 o o 0 o o o 
T a 8 o o 
O o o 
o © 2 
o 
Om a 
° o o 
o 
0.2- (0) o [6] 
o 
fe) Oo 
o o 
o a o 
o o o 
° o 5 
0 10 20 30 40 50 
age_months 


Fillable shapes can have different colors for the outline and interior. Changing the color 
aesthetic will only change the outline of our points: 


# Set outline color of the shapes by adding ‘color = cyan4° 


ggplot (data = malidd, 
mapping = aes(x = age months, 
Y= viral hiload))hi a 
Geomepon ni (shaper— 2A; # set shapes to display 
color "cyan4") # set outline color 


0.8 - x Ə 
o`o 
a o Sao oS 
© © 0 Folo yo © 
ae A cii o Bo 
a | eee z 
o6- °$ 08 2 
toi a ° 8 o 
o oppo 9 o 
o 
0008 28 
pel o 8 Te 
© oo a @ 
2 o fe) o [0] o (0) 
T 2 8 o o 
E o4- i 
> o o ale o 
a JTS 
o 
oS z 
(0) [e] 
o 
0.2 -4 ° o o 
nae 
o 
o o 
o Ol 
o 
fe) ° 
0 10 20 30 40 
age_months 
Now let's fill in the points: 
# Set interior color of the shapes by adding “fill = "seagreen"’ 


ggplot (data = malidd, 
mapping = aes(x = age months, 
YS" virall load) ae 


geom point (shape = 21, # set shapes to display 
color = "cyan4", # set outline color 
fill = "seagreen") A Set fill color 


50 


0.8 - ° 
e 
e © 
ee 2 8 e Soe 
e © ee @ ee 
ce ware Ole © 8e 
. e ° e o o z z 
4 e e3 
0.6 l feat opr 
e ee®e ° ® 
e 
oee8 38 
ke) e d 
f] Pre j e°’ 
2 e @ e e e a 
z e e 
g£ 0.4- e ~ ? 24.5 e 1 
S e Persia. e 
e ? id 
° 
e 
e 
° 3 a : 
0.2- e - : ~ ° 
e e e 
° F ô $ 
° e ° 
© e e 
0 10 20 30 40 
age_months 


We can improve the readability by increasing size and reducing opcaity with size and 
alpha, like we did before: 


ggplot (data = malidd, 
mapping = aes(x = age months, 
y virall load) k 


geom pone (Shape = 217 # set shapes to display 
color = "cyan4", # set outline color 
fill = "seagreen", i? MLSS fash EEA 
size = 2, # set size (mm) 
alpha = 0.75) # set level of opacity 


0.8- 
ee 
o e 
o $ o o o 
of °? ° oo?0 °oO © 
oo e > ° Se 
gee 
*¢ e o9 
e e 
0.6- L = os, 
@ oe 
fe) ee”e © e 
e 
oee8 38 
ke) ° g Ps 
© ee oe? 
= e © ee e @ 
— e 
© 0.4- 4 8o eoo, — e 
> o © o 8 (3 
e ° ° 
e 
° 
e @ 
© @ e 
0.2- @ o © — 
Se 
° 
e o 
O o 
° ° G 
° 
z o 
0 10 20 30 40 50 
age_months 


Adding a trend line 


It can be hard to view relationships or trends with just points alone. Often we want to add 
a smoothing line in order to see what the trends look like. This can be especially helpful 
when trying to understand regressions. 


To get a better idea of the relationship between these to variables, we can add a trend 
line (also known as a best fit line or a smoothing line). 


To do this, we add the function geom_smooth() to our scatter plot: 
ggplot (data = malidd, 


mapping = aes(x = age months, 
Ny = viralkioadi hi ap 


geom point() + 
geom_smooth () 


## “geom_smooth()* using method = 'loess' and formula = 'y ~ x! 


0.8 - 


0.6 - 


viral_load 
O 
A 


0.2 - 


0.0 - 


0 10 20 30 40 50 
age_months 


The smoothing line comes after our points an another geometric layer added onto our 
plot. 


The default smoothing function used in this scatter plot is “loess” which stands for for 
locally weighted scatter plot smoothing. Loess smoothing is a process used by many 


Statistical softwares. In {ggplot2} this generally should be done when you have less than 
1000 points, otherwise it can be time consuming. 


M 


any other smoothing functions can also be used in geom_smooth(). 


Let’s request a linear regression method. This time we will use a generalized linear model 
by setting the method argument inside geom_smooth(): 


# Change to a linear smoothing function with “method = "glm"- 
ggplot (data = malidd, 
mapping = aes(x = age months, 
y- viral VKeleKel)) ja: 


geom point() + 
geom_smooth (method = "glm") 


## “geom_smooth()* using formula = 'y ~ x' 


0.8 - 


0.6 - 


viral_load 
oO 
D 


0.2 < 


0.0- 


0 10 20 30 40 50 
age_months 


By default, 95% confidence limits for these lines are displayed. 


You can suppress the confidence bands by including the argument se = FALSE inside 
geom smooth (): 


# Remove confidence interval bands by adding ‘se = FALSE` 
ggplot (data = malidd, 
mapping = aes(x = age months, 
Y = Avaligelll Oae a 


geom point () + 
geom_smooth (method = "glm", 
se = FALSE) 


## “geom_smooth()* using formula = 'y ~ x' 


20 


0.8 - 


0.6 - 


viral_load 


0.2- 


0.0 - 


0 10 20 30 40 50 
age_months 


In addition to changing the method, let’s add the color argument inside geom _ smooth () 
to change the color of the line. 


21 


# Change the color of the trend line by adding ‘color = "darkred" ` 
ggplot (data = malidd, 
mapping = aes(x = age _ months, 
Y viralkan 


geom point () + 
geom smooth (method = "glm", 
se = FALSE, 
color = "darkred") 


## “geom_smooth()* using formula = 'y ~ x' 


0.8 - 


0.6 - 


viral_load 


0.2- 


0.0 - 


age_months 


This linear regression concurs with what we initially observed in the first scatter plot. A 


negative relationship exists between age_months and viral_load: as age increases, viral 
load tends to decrease. 


Let’s add a third variable from the malidd dataset calledvomit. This which is a binary 
variable that records whether or not the patient vomited. We will add the vomit variable 
to the plot by mapping it to the color aesthetic. We will again change the smoothing 
method to generalized additive model (“gam”) and make some aesthetic modifications to 
the line in the geom_smooth () layer. 


ggplot (data = malidd, 
Mapping = aes(x = age months, 
y- virall lbadh a: 


geom point (mapping aes (color = factor(vomit))) + 
geom smooth (method = "gam", 
size = 1.5, 
collcr -darkgray u) 
## `geom smooth ()` using formula = 'y ~ s(x, bs = "cs")' 


22 


0.8- = 


0.6- 


pz factor(vomit) 
© 0.4- 
—! 2j 0 
© 
= e 1 
0.2- 
0.0- 


(0) 10 20 30 40 50 
age_months 


Observe the distribution of blue points (children who vomited) compared to red points 
(children who did not vomit). The blue points mostly occur above the trend line. This 
shows that higher viral loads were not only associated with younger children, but that 
children with higher viral loads were more likely to exhibit symptoms of vomiting. 


e Create a scatter plot with the age_months and viral_load 
variables. Set the color of the points to “steelblue”, the size to 
2.5mm, the opacity to 80%. Then add trend line with the smoothing 


method “Im” (linear model). To make the trend line stand out, set 
PRACTICE its color to “indianred3”. 


e Recreate the plot you made in the previous question, but this time 
(in RMD) adapt the code to change the shape of the points to tilted 


rectangles (number 23), and add the body temperature variable 
(temp) by mapping it to fill color of the points. 


# Type and view your answer: 
age height 3 <- "YOUR ANSWER HERE" 


PRACTICE 
A 


prania 


age height 3 


(in RMD) 


Summary 


scatter plots display the relationship between two numerical variables. 


With medium to large datasets, you may need to play around with the different 
modifications to scatter plots we saw such as adding trend lines, changing the color, size, 
shape, fill, or opacity of the points. This tweaking is often a fun part of data visualization, 
since you'll have the chance to see different relationships emerge as you tinker with your 
plots. 


Contributors 


The following team members contributed to this lesson: 


JOY VAZ 


R Developer and Instructor, the GRAPH Network 
Loves doing science and teaching science 


ie ADMIN TEAM 
” GRAPH Courses Administration Team 


The GRAPH Courses team is building epidemiological training courses to 
enhance disease surveillance and data science for public health across the 
globe 


References 


Some material in this lesson was adapted from the following sources: 


« Ismay, Chester, and Albert Y. Kim. 2022. A ModernDive into R and the Tidyverse. 
https://moderndive.com/. 
e Kabacoff, Rob. 2020. Data Visualization with R. https://rkabacoff.github.io/datavis/. 


24 


e Giroux-Bougard, Xavier, Maxwell Farrell, Amanda Winegardner, Etienne Low-Decarie 
and Monica Granados. 2020. Workshop 3: Introduction to Data Visualisation with 
{ggplot2}. http://r.qcbs.ca/workshop0O3/book-en/. 


ar tge license () 


Lesson notes | Lines, scales, and labels 


Created by the GRAPH Courses team 


January 2023 


This document serves as an accompaniment for a lesson found on https:// 
thegraphcourses.org. 


The GRAPH Courses is a project of the Global Research and Analyses for Public Health 
(GRAPH) Network, a non-profit headquartered at the University of Geneva Global Health 
Institute, and supported by the World Health Organization (WHO) and other partners 


Learning OBJECtiVE S esre rrer Sues a ami ama rede ee he SHE ON Sed OSE e E 
OOUE ON eai 53:4. barre e bie 8 O44 Oe eo aes Oe es Oe Bord OOS Oe Oe Gas oka 
PN S anus e Gy ee aoe ee BEG ES Ge Oe ee Fe ees 64g oe Ae ae 
The Gapmimder data trame .2543.0 ceases thaw Aoedeb eee aed sue Ons nee AEE oS we 
Line Graphs: Vid geom tine () esiseseme ais edd eee RESO Gat Rhone E E RR R ER de RAs 

Fixed aesthetics iñ geom Line () 4442544540994 0468 964904 LEER EETEEPED HEAP eikai 
Combining compatible JEO S se ssessaroirotto tioti ERRARE E ERRERA PE E EAEE SEG 
Mapping data to multiple lines .....nn nannaa aaa 
Modifying CONLINUGUS YY SCHSS. roer anneara nren tarem ee eh eS Re kee ae 

SCE DERES e reece oo 8 oh hao oe 4b bk a ob es O45 48 on bene ae 

Oger SOONG) 6.255.454 444 48s eae G64 oo OE ai Sewer t Re Aa Speer eres 
Labeling with eee | 6nd eee eee ocr aed med dee eG te bE Oe Ea eee ee 
Previews TNemeS 222.04 4045045 $405 S40 ROS £:6-kG4.4 96.9: 4 SEE SSC SO SSH ER HE'S) SHES 
O O a nr rs he oo boa Oo og Sd oe weg oe eee ee haw gece 56% 40s oa eae 


Learning Objectives 


1. You can create line graphs to visualize relationships between two numerical 
variables with geom_line(). 

2. You can add points to a line graph with geom point (). 

3. You can use aesthetics like color, size, color, and linetype to modify line 
graphs. 

4. You can manipulate axis scales for continuous data with scale_* continuous () 
and scale_*_log10(). 

5. You can add labels to a plot such as a title, subtitle, or caption with the 
labs () function. 


GDP per capita in selected Asian economies, 1952-2007 


Income is measured in US dollars and is adjusted for inflation. 


$5,000.00 © 


S $3,000.00 s 
5 Country 
a , 
D ® China 
p -e Indi 
© ndia 
5 ®- Thailand 
& $1,000.00 

$500.00 


1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
Year 
Source: www.gapminder.org/data 


D 


Introduction 


Line graphs are used to show relationships between two numerical variables, just like 
scatterplots. They are especially useful when the variable on the x-axis, also called the 
explanatory variable, is of a sequential nature. In other words, there is an inherent 
ordering to the variable. 


The most common examples of line graphs have some notion of time on the x-axis: 
hours, days, weeks, years, etc. Since time is sequential, we connect consecutive 
observations of the variable on the y-axis with a line. Line graphs that have some notion 
of time on the x-axis are also called time series plots. 


oo 


Packages 


# Load packages 

pacman: :p_ load(tidyverse, 
gapminder, 
here) 


The gapminder data frame 


In February 2006, a Swedish physician and data advocate named Hans Rosling gave a 
famous TED talk titled “The best stats you've ever seen” where he presented global 
economic, health, and development data complied by the Gapminder Foundation. 


GAJ | M l l N) D) =] n Donate Resources About Login 


Animating Data 


Get the proportions right and realize the 
macrotrends that will shape the future. 


Understand a changing world 


We can access a clean subset of this data with the R package {gapminder}, which we just 
loaded. 


# Load gapminder data frame from the gapminder package 
data(gapminder, package="gapminder") 


# Print dataframe 
gapminder 


Each row in this table corresponds to a country-year combination. For each row, we have 
6 columns: 


1. country: Country name 


2. continent: Geographic region of the world 
3. year: Calendar year 


4. 1ifeExp: Average number of years a newborn child would live if current mortality 
patterns were to stay the same 


5. pop: Total population 
6. gdpPercap: Gross domestic product per person (inflation-adjusted US dollars) 


The str() function can tell us more about these variables. 


# Data structure 
str (gapminder) 


tibble [1,704 x 6] (S3: tbl df/tbl/data. frame) 


$ country : Factor w/ 142 levels "Afghanistan",..: 1111111111 
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3333333 3 3 
3 eea 
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 
997 


$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 

$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 
14880372 12881816 13867957 16317921 22227415 

$ gdpPercap: num [1:1704] 779 821 853 836 740 


This version of the gapminder dataset contains information for 142 countries, divided in 
to 5 continents. 


Gapminder world regions 
Five regions in the ‘continent variable of ‘gapminder 


50 - 


continent 
[=] Africa 

pe w Americas 

w 

A | Asia 

9- o Europe 
E Oceania 
-50 - 
-100 0 100 200 
long 

# Data summary 

summary (gapminder) 

Ht country continent year lifeExp pop 

gdpPercap 

## Afghanistan: 12 Africa :624 Min. 31.952 Min. 223.60 Min. 

:6.001e+04 Min. : 241.2 

## Albania 12 Americas:300 Ist Qu.:1966 lst Qu.:48.20 1st 

Qu.:2.794e+06 Ist Qu.: 1202.1 

## Algeria 12 Asia 7396 Median :1980 Median :60.71 Median 

:7.024e+06 Median 3531.8 

## Angola z 12 Europe :360 Mean 1980 Mean 259.47 Mean 

:2.960e+07 Mean T253 

## Argentina : 12 Oceania 24 3rd Quüs1993 3rd Qu.:70.85 3rd 

Qu.:1.959e+07 3rd Qu»: 9325.5 

## Australia : 12 Max. :2007 Max. :82.60 Max. 

:1.319e+09 Max. 2113523.1 

## (Other) 21632 


Data are recorded every 5 years from 1952 to 2007 (a total of 12 years). 


Let's say we want to visualize the relationship between time (year) and life expectancy 
(lifeExp). 


For now let’s just focus on one country - United States. First, we need to create a new 
data frame with only the data from this country. 


# Select US cases 
gap_US <- dplyr::filter(gapminder, 
country == "United States") 


gap_US 


The code above is a covered in our course on Data Wrangling using the 
REMINDER ‘@Plyr} package. Data wrangling is the process of transforming and 

modifying existing data with the intent of making it more appropriate for ! 

analysis purposes. For example, this code segments used the filter () 

function to create a new data frame (gap_US) by choosing only a subset 

of rows of original gapminder data frame (only those that have “United 

States” in the country column). 


Line graphs via geom line () 


Now we're ready to feed the gap _US data frame to ggplot (), mapping time in years on 
the horizontal x axis and life expectancy on the vertical y axis. 


We can visualize this time series data by using geom_line() to create a line graph, 
instead of using geom point () like we used previously to create scatterplots: 


# Simple line graph 
ggplot (data = gap_US, 
mapping = aes(x = year, 
y = lifeExp)) + 


geom_line() 


78- 


76- 


74- 


lifeExp 


724 


70- 


1970 1980 1990 2000 


year 


68 =; 1 
1950 1960 


Much as with the ggplot () code that created the scatterplot of age and viral load with 
geom point (), let's break down this code piece-by-piece in terms of the grammar of 
graphics: 

Within the ggplot () function call, we specify two of the components of the grammar of 


graphics as arguments: 


1. The data to be the gap_US data frame by setting data = gap US. 


2. The aesthetic mapping by setting mapping = aes(x = year, y 
Specifically, the variable year maps to the x position aesthetic, while the variable 


lifeExp maps to the y position aesthetic. 


= lifeExp). 


After telling R which data and aesthetic mappings we wanted to plot we then added the 
third essential component, the geometric object using the + sign, In this case, the 
geometric object was set to lines using geom_line(). 


PRACTICE 
Create a time series plot of the GPD per capita (gdpPercap) recorded in 


ww the gap _Us data frame by using geom_line() to create a line graph. 
(in RMD) 


Fixed aesthetics in geom line () 


The color, line width and line type of the line graph can be customized making use of 


color, size and linetype arguments, respectively. 


We've changed the color and size of geoms in previous lessons. 


Here we will add these as fixed aesthetics: 


# 
ggplot(data = gap_US, 


nhanced line graph with color and size as fixed aesthetics 


mapping = aes(x = year, 
y = lifeExp)) + 
geom lime (color — ie hearstleus 
size = 1.5) 
78- 
76- 
74- 
a 
x 
Lu 
2 
72- 
70- 
68-, ' 1 ' ' 
1950 1960 1970 1980 1990 


year 


2000 


In this lesson we introduce a new fixed aesthetic that is specific to line graphs: linetype 


(or 1ty for short). 


10 


Ity = 0 or ‘blank’ 

Ity = 1 or ‘solid’ 
See ee ees Ity = 2 or ‘dashed’ 
suave vaxdédeacanessiesieeceeadees Ity = 3 or ‘dotted’ 
pHini A ssn eee wen Ity = 4 or 'dotdash' 
= Ity = 5 or ‘longdash' 


See ee Ity = 6 or 'twodash' 


Line type can be specified using a name or with an integer. Valid line types can be set 
using a human readable character string: "blank", "solid", "dashed", "dotted", 
"dotdash", "longdash", and "twodash" are all understood by linetype or lty. 


# Enhanced line graph with color, size, and line type as fixed aesthetics 
ggplot(data = gap_US, 
mapping = aes(x = year, 
y = lifeExp)) + 


geom line(color = WV ehws elegy 
size =1.5, 
linetype = "twodash") 


78- 


76- 
74- 
[om 
x< 
LW 
2 
724 
70- 
68 -5 1 1 ' ' ' 
1950 1960 1970 1980 1990 2000 
year 


In these line graphs, it can be hard to tell where exactly there data points are. In the next 
plot, we'll add points to make this clearer. 


Combining compatible geoms 


As long as the geoms are compatible, we can layer them on top of one another to further 
customize a graph. 


For example, we can add points to our line graph using the + sign to add a second geom 
layer with geom point (): 


# Simple line graph with points 
ggplot(data = gap_US, 
mapping = aes(x = year, 
y = lifeExp)) + 


geom lene) ii 
geom point () 


78 - 


76 - 
74- 
Q 
x 
W 
2 
724 
70- 
68- , 1 1 1 1 1 
1950 1960 1970 1980 1990 2000 
year 


We can create a more attractive plot by customizing the size and color of our geoms. 


# Line graph with points and fixed aesthetics 


ggplot (data = gap US, 
mapping = aes(x = year, 
y = lifeExp)) + 


Geommlineisizem— o; 
color = Wivghrg rey) 


geom point (size — 3; 
color = "steelblue") 


+ 


78 - 
© 
@ 
76- d 
@ 
(d 
74- 
S 
ú @ 
2 
72- 
(J 
@ 
70 - = 
o 
@ 
68 - 1 1 1 ' ' 1 
1950 1960 1970 1980 1990 2000 
year 
Building on the code above, visualize the relationship between time and 
GPD per capita from the gap US data frame. 
PRACTICE = 
Use both points and lines to represent the data. 
(in RMD) 


Change the line type of the line and the color of the points to any valid 
values of your choice. 


Mapping data to multiple lines 


In the previous section, we only looked at data from one country, but what if we want to 
plot data for multiple countries and compare? 


First let's add two more countries to our data subset: 


# Create data subset for visualizing multiple categories 
gap mini <- filter(gapminder, 
country sins c(i Una eed Sitaresiy,, 
Haust realan 
"Germany") ) 


gap mini 
If we simply enter it using the same code and change the data layer, the lines are not 


automatically separated by country: 


# Line graph with no grouping aesthetic 
= Gap mann, 


ggplot (data = 
mapping = aes(y = lifeExp, 
x = year)) + 
geom line() + 
geom point () 
78- 
a 
x 
Lu 
£ 74- 
70- 
1960 1970 1980 1990 2000 
year 


1950 
This is not a very helpful plot for comparing trends between groups. 
To tell ggplot () to map the data from each country separately, we can the group 


argument as an as aesthetic mapping: 


# Line graph with grouping by a categorical variable 


ggplot (data = gap mini; 
mapping = aes(y = lif 

x = year, 
group = country)) + 


Exp, 


geom line() + 
geom point () 


78- 
[ok 
x 
ùÍ 
L 74- 
70- 
1950 1960 1970 1980 1990 
year 


Now that the data is grouped by country, we have 3 separate lines - one for each level of 


the country variable. 


We can also apply fixed aesthetics to the geometric layers. 


# Applying fixed aesthetics to multiple lines 
ggp lor (data = gap mini, 
mapping = aes (y = lifeExp, 
x = year, 
group = country)) + 
geom lane (dlanezype="longdash, 
color="tomato", 
size=1) + 


set line type 
set line color 
set line size 
set point size 


Se THe SH: SHE 


GeomepoinmiE(srze — 2) 


2000 


A 
P 
as 
A" 
P 
78- v 4 AE aa 
S M A 
aw LA 
D 74- 7 á 
i L L g 
OMe 
FEA 
pee —— ww 
70- pope 
~ vs 
wy 
"a 
1950 1960 1970 1980 1990 2000 
year 


In the graphs above, line types, colors and sizes are the same for the three groups. 


This doesn't tell us which is which though. We should add an aesthetic mapping that can 
help us identify which line belongs to which country, like color or line type. 


# Map country to color 
ggplot (data = gap mini, 
PRO i yed; 


mapping = aes (y = lifel 
group =- COUNCEY? 


color = couüntry)) 


+ 


fi 
T 


geomi line(siz 


1) 
geom Lorne (suze — 2) 


78 - 


country 
2 == Australia 
W 
® 74- =@= Germany 
=®- United States 
70- 


1950 1960 1970 1980 1990 2000 
year 


Aesthetic mappings specified within ggplot () function call are passed down to 
subsequent layers. 


Instead of grouping by country, we can also group by continent: 


# Map continent to color, line type, and shape 
ggplot (data = gap mini, 
mapping = aes(x = year, 
y = lifeExp, 
color = continent, 
lity = continent, 
shape = continent)) + 


geom Manelsrzel= 


1) 
geom pointi(size = 2) 


78 - 


continent 
Q =® Americas 
in 
® 74- =å Europe 
=E: Oceania 
70- 
1950 1960 1970 1980 1990 2000 


year 


When given multiple mappings and geoms, {ggplot2} can discern which mappings apply 
to which geoms. 


Here color was inherited by both points and lines, but 1ty was ignored by 
geom point () and shape was ignored by geom line (), since they don't apply. 


Challenge 


Mappings can either go in the ggplot () function or in geom * () layer. 


CHALLENGE For example, aesthetic mappings can go in geom line () and will only be 
% applied to that layer: 
gop lot (data = gap mini, 
mapping = aes(x = year, 
y = lifeExp)) + 
geomaline(siize —- I mappinge— <aesicolor ys cont iment) ia: 


gecom point (mapping = aes(shape = country, 
size = pop)) 


continent 


== Americas 


78- == Europe 
== Oceania 
pop 
Q @ 1c+08 
ù 
g 74- @ 208 
CHALLENGE 3e+08 
: @ =: 
AA country 
e Australia 
70- 4 Germany 
= United States 
1950 1960 1970 1980 1990 2000 
year 
Try adding mapping = aes () in geom point () and map continent to 
any valid aesthetic! 
PRACTICE 


Using the gap _mini data frame, create a population growth chart with 


(in RMD) these aesthetic mappings: 


21 


3e+08 - - 


2e+08 - ae 
pers country 
a er — Australia 
fe} 
a ---» Germany 
=-=: United States 
1e+08 - 


1950 1960 1970 1980 1990 2000 
year 


Practice Next, adda layer of points to the previous plot, and add the required 
aesthetic mappings to produce a plot that looks like this: 


(in RMD) 


3e+08 - ee 


-7 continent 
sk e Americas 
2e+08 - Poa 
Pi 4 Europe 


w m Oceania 


pop 
a 


country 
-® Australia 
1e+08 - -@- Germany 


TERRE Paare A -®- United States 


06+00 5; T 1 ' 1 
1950 1960 1970 1980 1990 2000 
year 
Don't worry about any fixed aesthetics, just make sure the mapping of 
data variables is the same. 


Modifying continuous x & y scales 


{ggplot2} automatically scales variables to an aesthetic mapping according to type of 
variable it’s given. 


# Aticomabie (scaling for x, y; and Color 


ggplot (data = gap mini; 


mapping = aes(x = year, 
y = lifeExp, 


Color = country») T 
gc omines rze mi 
78- 
country 
2 === Australia 
Lu 
g 74- = Germany 
=== United States 
70- 
1950 1960 1970 1980 1990 2000 
year 


In some cases the we might want to transform the axis scaling for better visualization. We 
can customize these scales with the scale *() family of functions. 


22 


geplot(data = QUALE , 


mapping = aes@ZU SHOE) + | Required 
E 
stat -JNB position -4N ) + 7 
P COORDINATE FUNCTIONS recur 
defaults 
pee 


<THEME_FUNCTION> 


scale _ x continuous() and scale_y continuous () are the default scale functions for 


continuous x and y aesthetics. 


GENERAL PURPOSE SCALES 

Use with most aesthetics 

scale_*_continuous() - map cont’ values to visual ones 
scale_*_discrete() - map discrete values to visual ones 
scale_*_identity() - use data values as visual ones 
scale_*_manual(values = c()) - map discrete values to 
manually chosen visual ones 

scale_*_date(date_labels = "%m/%d"), date_breaks = "2 
weeks") - treat data values as dates. 

scale_*_datetime() - treat data x values as date times. 
Use same arguments as scale_x_date(). See ?strptime for 
label formats. 


ihi 
ihi. 


X & Y LOCATION SCALES 
Use with x or y aesthetics (x shown here) 


scale_x_log10() - Plot x on log10 scale 
scale_x_reverse() - Reverse direction of x axis 
scale_x_sqrt() - Plot x on square root scale 


© 
+x 


Scale breaks 


COLOR AND FILL SCALES (CONTINUOUS) 


o <-c + geom_dotplot(aes(fill = ..x..)) 
o+scale_fill_distiller(palette = "Blues") 
o +scale_fill_gradient(low="red", high="yellow") 


o+scale_fill_gradient2(low="red", high=“blue", 
mid = "white midpoint a‘ 25) z 


o + scale_fill_gradientn(colours=topo.colors(6)) 


Also: rainbow(), heat.colors(), terrain.colors(), 
cm.colors(), RColorBrewer::brewer.pal() 


SHAPE AND SIZE SCALES 


<-e+geom nt(aes(shape = fl, size = 
a ae eae parapri ; a 
p+ scale_shape_manual(values = c(3:7)) 
012345 67 8 91011 1213 14151617 181920 2 23425 
DOA+xOvB*Penks so00A000OndAY 
p + scale_radius(range = c(1,6)) 

p+ scale_size_area(max_size = 6) 


Let’s create a new subset of countries from gapminder, and this time we will plot changes 


in GDP over time. 


# Data subset to include India, 
gap_mini2 <- filter(gapminder, 


China, and Thailand 


"Thailand"))gap_mini2 


Here we will change the y-axis mapping from 1ifeExp to gdpPercap: 


ggplot (data = gap mini2, 
mapping = aes(x = year, 


geom_line(size = 


6000 - 


gdpPercap 
è 
S 


2000 - 


1950 1960 


y = gdpPercap, 

Group — Country, 

Color = Country) T 
ORTS) 


country 
=— China 
— India 


= Thailand 


1970 1980 1990 2000 
year 


The x-axis labels for year in don't match up with the dataset. 


gap_mini2Syear %>% unique () 


## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 


We can specify exactly where to label the axis by providing a numeric vector. 


# You can manually enter scale breaks (don't do this) 


24 


eG Leis Nee WG Ie I noise net IU SE y AN NT 


## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 


# It's better to create the vector with seq() 
seq(from = 1952, to = 2007, by = 5) 


## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 


Use scale x continuous to make the axis breaks match up with the dataset: 


# Customize x-axis breaks with ‘scale x continuous (breaks = VECTOR) ~ 
ggplot (data = gap _mini2, 
mapping = aes(x = year, 
y = gdpPercap, 
collorn = country), + 
Geommlkenex(S sizer m DME 
scale Continuous breaks seg (irons OSZT tO 2 UOls, love E meet: 
geom point () 


6000 - 

a country 

© 

oO aon r 

© 4000- China 

a =e India 

g=] 

D =® Thailand 
2000 - 


1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
year 


25 


Store scale break values as an R object for easier reference: 


# Store numeric vector to a named object 
Gapmycars =< sequterony— | O52 Eom ZOUK jap Oya a) 


# Replace seq() code with named vector 


ggplot (data 


= gap mini 2, 


mapping = aes(x = year, 


geom_lin 


y = gdpPercap, 
color = country)) + 


(size = 1) 4 


scale x continuous (breaks = gap_years) 


6000 - 


gdpPercap 
5 
s 


2000 - 


1952 


PRACTICE 


(in RMD) 


country 
= China 
= India 


== Thailand 


1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
year 


We can customize scale breaks on a continuous y-axis values with 
scale y continuous (). 


Copy the code from the last example, and add scale y continuous () 
to add the following y-axis breaks: 


26 


7000 - 


6000 - 


5000 - 
country 


Qa 
Oo 
PRACTICE ee — China 
a — India 
wT 
D = Thailand 
(in RMD) 3000 - 


2000 - 


1000 - 


1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
year 


Logarithmic scaling 


In the last two mini sets, | chose three countries that had similar range of GDP or life 
expectancy for good scaling and readability so that we can make out these changes. 


But if we add a country to the group that significantly differs, default scaling is not so 
great. 


We'll look at an example plot where you may want to rescale the axes from linear to a log 
scale. 


Let’s add New Zealand to the previous set of countries and create gap mini3: 


# Data subset to include India, China, Thailand, and New Zealand 
gapiminio <- filter(gapminder, 
COUMEBY eino e OTda 
Wehamacy, 
Witivadele nes. 
"New Zealand") ) 


gap_mini3 


Now we will recreate the plot of GDP over time with the new data subset: 


ggplot (data = gap minio; 
mapping = aes(x = year, 
y = gdpPercap, 


27 


color = COuUntry) I a ocom elabnor(iSiez em ONS) 
scale x Continuous (breaks = gap_years) 


25000 - 
20000 - 
15000 ae 
a =— China 
2 
o a ; 
o India 
ne] = New Zealand 
D 
10000 - — Thailand 
5000 - 
0 - 


1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
year 


The curves for India and China show an exponential increase in GDP per capita. However, 
the y-axes values for these two countries are much lower than that of New Zealand, so the 
lines are a bit squashed together. This makes the data hard to read. Additionally, the large 
empty area in the middle is not a great use of plot space. 


We can address this by log-transforming the y-axis using scale _y 1og10(), which log- 
scales the y -axis (as the name suggests). We will add this function as a new layer after a 
+ Sign, as usual: 


A Add "scalely Logo) 
ggplot (data = gapi minis; 
mapping = aes(x = year, 
y = gdpPercap, 
color = country)) + 


Geom Titas (ena =N 
scale x continuous(breaks = gap years) + 
sealle ya hogliar) 


28 


30000 - 


10000 - 


country 

== China 

== India 

= New Zealand 


== Thailand 


gdpPercap 
S 
s 


1000 - 


1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
year 


Now the y-axis values are rescaled, and the scale break labels tell us that it is nonlinear. 


We can add a layer of points to make this clearer: 


ggplot (data = gap _mini3, 
mapping = aes(x = year, 
y = gdpPercap, 
color = country), + 
geommiiimesize = hia 
scale x continuous (breaks = gap years) + 
scale y logl0() + 7 
geom_ point () 


29 


30000 - 


10000 - 


=® China 
3000 - =e India 
== New Zealand 
== Thailand 
1000 - 


1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
year 


gdpPercap 


First subset gapminder to only the rows containing data for Uganda: 
PRACTICE 
Now, use gap_Uganda to create a time series plot of population (pop) 
over time (year). Transform the y axis to a log scale, edit the scale 
breaks to gap_years, change the line color to forestgreen and the size 
to Imm. 


(in RMD) 


Next, we can change the text of the axis labels to be more descriptive, as well as add titles, 
subtitles, and other informative text to the plot. 


Labeling with labs () 


You can add labels to a plot with the labs () function. Arguments we can specify with the 
labs () function include: 


e title: Change or add a title 


30 


e x: Rename x-axis 


e y: Rename y-axis 
e caption: Add caption below the graph 


Let’s start with this plot and start adding labels to it: 


# Time series plot of life expectancy in the United States 


ggplot(data = gap_US, 


mapping = aes(x = year, 
y = lifeExp)) + 
Geom bine (sek em — RS 
color — "lightorey™)) + 
= 3, 


geom point (size 
color = "steelblue") 4 


scale x continuous (breaks gap years) 


© 
78- 
® 
B 
76- © 
® 
o 
74- 

x 

Ww o 

2 

72 5 
© 
® 
70- a 
t 
© 
68 - 1 ' 1 1 1 ' ' 1 1 1 ' ' 
1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
year 


We add the labs () to our code using a + sign. 
First we will add the x and y arguments to labs (), and change the axis titles from the 
default (variable name) to something more informative. 


# Rename axis titles 
ggplot (data = gap_US, 


mapping = aes(x = year, 


31 


Y= SES sp) )) tr gecom line (size ely, 
Colori — ania EGiee vy) it SOOM poine (size = e; 
color = "steelblue") + 
scale x continuous (breaks gap ysars) + Wabsi(x = "Year", 
y = "Life Expectancy (years)") 
o 
784 
© 
© 
76- o 
e e 
g @ 
2 
74- 
3 
s e 
O 
® 
Q 
x 
72 
2 
= 
@ 
td 
70- m 
© 
® 
68 i 1 I ' I ' 1 ' ' I ' I ' 
1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
Year 


Next we supply a character string to the title argument to add large text above the 


plot. 


# Add main title: 
ggplot (data gap_US, 
mapping 


X 
17 


geom_line (size 
COLOR 


geom porinti(size = 3, 


aes (x = 


Miightorey™) 


"Lifespan increases over time" 


year, 


lifeExp)) + 


+ 


ehs 


color = "MSE 


scale x continuous (breaks = gap years) 


labs (x 
y 


"Year", 
= "Life 


Expectancy 


a) } 
+ 


(years)", 


title = "Lifespan increases over time") 


32 


Lifespan increases over time 


78- 


N ~ 
R © 
! 1 

@ 


Life Expectancy (years) 
N 
® 


705 


68 - 1 ' 1 1 1 1 ' 1 1 1 ' 
1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 
Year 


The subtitle argument adds smaller text below the main title. 


# Add subtitle with location and time frame 
ggplot (data = gap US, 
mapping = aes(x = year, 


y = lifeExp)) + 
geom limesrze = lao, 
color = Vaghtorey) EE 
Geomepoinie (sizes — Sy, 
color = "steelblue") + 
scale x Continuous (breaks = gaplycsars) E 
labs(x = "Year", 
y = "Life Expectancy (years)", 
title = "Life expectancy changes over time", 
subtitle = "United States (1952-2007)") 


33 


2007 


Life expectancy changes over time 
United States (1952-2007) 


78- 


N ~ 

N © 

I 1 
e 


Life Expectancy (years) 
N 
® 


704 


68 - 1 ' 1 1 1 ' ' 1 1 1 ' ' 
1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 


Year 


Finally, we can supply the caption argument to add small text to the bottom-right corner 
below the plot. 


# Add caption with data source: "Source: www.gapminder.org/data" 
ggplot (data = gap US, 
mapping = aes(x = year, 


y = lifeExp)) + 
Geommaineismzes— mo; 
color = "lightgrey") + 
geom point (size = 3, 
color = "steelblue") 4 
scale x continuous (breaks = gap_years) + 
labs (x = "Year", 
y = "Life Expectancy (years)", 
title = rat xpectancy changes over time", 
subtitle = "United States (1952-2007)", 
caption = "Source: http://www.gapminder.org/data/") 


34 


35 


Life Expectancy (years) 


Life expectancy changes over time 
United States (1952-2007) 


78 - 

76- 

74- 

724 

70- 
d 

68- 1 
1952 

CHALLENGE 
K 


Aih 


141957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
Year 


Source: http://www.gapminder.org/data/ 


When you use an aesthetic mapping (e.g., color, size), {ggplot2} 
automatically scales the given aesthetic to match the data and adds a 
legend. 


Here is an updated version of the gap_mini3 plot we made before. We 
are changing the of points and lines by setting aes (color = country) 
in ggplot (). Then the size of points is scaled to the pop variable. See 
that labs () is used to change the title, subtitle, and axis labels. 


ggplot(data = gap mini2, 
mapping = aes(x = year, 
y = gdpPercap, 
color = country)) + 


geompelane (size = NE 

geom point (mapping = aes(size = pop), 
alpha = 0.5) + 

geom_point() + 

scale x continuous (breaks = oap years) + 


seallesy Vogl: a: 
labs (x = "Year", 
y = "Income per person", 


CHALLENGE 
K 


YN 


title = "GDP per capita in selected Asian economies, 


TODAS 2O07 


subtitle = "Income is measured in US dollars and is 


adjusted Lor wnt larironm. 4) 


GDP per capita in selected Asian economies, 1952-2007 
Income is measured in US dollars and is adjusted for inflation. 


5000 - 


3000 - 


Income per person 


1000 - 


500 - 


1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 


Year 


pop 
@ 2.50e+08 
@ 5.00e+08 
@ 7.50e+08 
@ 1.00e+09 


@ 1.25e+09 


country 
=® China 
=e India 


= Thailand 


The default title of a legend or key is the name of the data variable it 
corresponds to. Here the color lengend is titled country, and the size 


legend is titled pop. 


We can also edit these in labs () by setting AES NAMI 


"CUSTOM TITLE". 


ggplot(data = gap mini2, 


mapping = aes(x = year, 
y = gdpPercap, 
color = country) )) + 
geom line(size = 1) i 


geom point (mapping =- aes(size = Pop), 
alpha = 0.5) + 

geom point (OME 

scale x continuous breaks — gap years) E 


scale y logl0() + 
labs (x = "Year", 
y = "Income per person", 


[| 


36 


Collom — Coumeny.! 7, size = "Population") 


GDP per capita in selected Asian economies, 1952-2007 
Income is measured in US dollars and is adjusted for inflation. 


5000 - Population 
@ 2.50e+08 
@ 5.00e+08 
@ 7.50e+08 
@ 1.00e+09 


@ 1.25c+09 


3000 - 


CHALLENGE 
K 


Aih 


Country 


Income per person 


1000 - = China 
=e India 


=® Thailand 


500 - 


1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
Year 


The same syntax can be used to edit legend titles for other aesthetic 
mappings. A common mistake is to use the variable name instead of the 


aesthetic name in labs (), so watch out for that! 


Create a time series plot comparing the trends in GDP per capita from 
1952-2007 for three countries in the gapminder data frame. 


First, subset the data to three countries of your choice: 
Use my_gap_mini to create a plot with the following attributes: 


PRACTICE 
ə Add points to the line graph 


(in RMD) e Color the lines and points by country 
e Increase the width of lines to Imm and the size of points to 2mm 


ə Make the lines 50% transparent 


37 


Finally, add the following labels to your plot: 


e Title: “Health & wealth of nations” 
PRACTICE 


e Axis titles: “Longevity” and “Year” 


(in RMD) e Capitalize legend title 


(Note: subtitle requirement has been removed.) 


Preview: Themes 


In the next lesson, you will learn how to use theme functions. 


# Use theme minimal () 
ggplot (data = gap mini2, 
mapping = aes(x = year, 
y = gdpPercap, 
color = Country) E 
gcom liime (size PMM alpha = Oei 
geomiporne cizes N) 
scale x continuous (breaks = gap_years) + 
scale y rogo) act 


labs (x = USARE 
y = "Income per person", 
title = "GDP per capita in selected Asian economies, 1952-2007", 
subtitle = "Income is measured in US dollars and is adjusted for 


PAE AEON A 


caption = "Source: www.gapminder.org/data") + 
theme minimal() 


38 


GDP per capita in selected Asian economies, 1952-2007 
Income is measured in US dollars and is adjusted for inflation. 


A 
(3 © 
5000 Á @ 
d 
J 
c 3000 g 
O ' 
S country 
a ® China 
a 
® =® India 
E © Thailand 
[8] 
E 
™ 1000 
500 
6 
1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 
Year 
Source: www.gapminder.org/data 
Wrap up 


Line graphs, just like scatterplots, display the relationship between two numerical 
variables. When one of the two variables represents time, a line graph can be a more 
effective method of displaying relationship. Therefore, it is preferred to use line graphs 
over scatterplots when the variable on the x-axis (i.e., the explanatory variable) has an 
inherent ordering, such as some notion of time, like the year variable of gapminder. 


We can change scale breaks and transform scales to make plots easier to read, and label 
them to add more information. 


Hope you found this lesson helpful! 


Contributors 


The following team members contributed to this lesson: 


39 


JOY VAZ 


R Developer and Instructor, the GRAPH Network 
Loves doing science and teaching science 


F ADMIN TEAM 


GRAPH Courses Administration Team 

The GRAPH Courses team is building epidemiological training courses to 
enhance disease surveillance and data science for public health across the 
globe 


References 


Some material in this lesson was adapted from the following sources: 


« Ismay, Chester, and Albert Y. Kim. 2022. A ModernDive into R and the Tidyverse. 
https://moderndive.com/. 

e Kabacoff, Rob. 2020. Data Visualization with R. https://rkabacoff.github.io/datavis/. 

e https://www.rebeccabarter.com/blog/2017-11-17-ggplot2_tutorial/ 


This work is licensed under the Creative Commons Attribution Share Alike license. 


40 


